I've covered Benford's method for first-digit fraud analysis before, and now Nate Silver has applied a similar method to polling results. He looked at the last digit of various polls (i.e. a 48% McCain, 49% Obama, 3% undecided poll would be recorded as an 8 and a 9) and compiled histograms of their frequencies. Following up on his suspicions that all was not right with one polling firm in particular, Nate noticed that their results did not conform to the expected random distribution:
This data is not random at all. For instance, the trailing digit was '8' on 676 occasions, almost 60 percent more often than the 431 times that it was '1'. Over a sample of more than 5,000 data points, such an outcome occurring by chance alone would be an incredible fluke -- millions to one against. Bad luck can essentially be ruled out as an explanation.
One of two things seems to have happened, then.
One possibility is that there is some intrinsic, mathematical reason that certain trailing digits are more likely to come up than others. This is certainly possible -- and in fact, it would be somewhat likely if the polling data that we were looking at were homogeneous -- McCain versus Obama polls in Ohio, for instance.
But Strategic Vision's polls cover a wide array of topics: Presidential horse race numbers in any of a dozen or so states, senate and gubernatorial polling, primary polling, approval ratings of various kinds, polling on issues like the war in Iraq, and more abstract questions such as whether voters think that 'experience' or 'change' is the more important quality in a Presidential candidate. No one type of question, in no one state, represents more than a relatively small fraction of the sample. Under those circumstances, I can't think of any reason why the trailing digit wouldn't approach being random -- although there absolutely might be reasons that I haven't thought of.
But this data is not random. It's not close to random. It's not close to close. Which brings up the other possibility: Strategic Vision is cooking the books. And whoever is doing so is doing a pretty sloppy job. They'd seem to have a strong, unconscious preference for numbers ending in '7', for instance, as opposed to those ending in '6'. They tend to go with round numbers that end in '5' or '0' slightly too often. And they much prefer numbers with high trailing digits like 49 and 38 to those with low ones like 51 and 42.
I haven't really seen anyone approach polling data like this before, and I certainly haven't done so myself. So, we cannot rule out the possibility that there is some mathematical rationale for this that I haven't thought of. But it looks really, really bad. There is a substantial possibility -- far from a certainty -- that much of Strategic Vision's polling over the past several years has been forged.
Is there a mathematical reason for such a discrepency in poll results? I can think of only one possibility - there is a weak dependence structure in the data. One last digit exerts some influence over the other. With no one undecided, the dependence is perfect: a 4 on one sample requires a 6 on the other (i.e. 44% and 56%; 34% and 66%). With a fixed level of undecided people, the dependence remains perfect. If the undecided level is stochastic, then the dependence becomes more weak. However, it's unclear to me why this would skew the results in the manner this firm exhibits; this could require high numbers to be paired with low numbers, or high numbers to be paired with high numbers (or vice versa), but wouldn't lead to more high numbers in and of itself.
I'm very curious (and not just as a statistician) to see what comes of this... as much as we'd like statistics to give us firm answers, often the limit of its ability is to reveal probable courses of investigation, or lend strong - but at some level uncertain - backing to an argument.