A Derivation of Benford's Law or: Roll Your Own Fraud Detector

May 1, 2009 in Math

Recently a favorite statistical anomaly of mine came up in the course of my being constantly and utterly fascinated by our universe. It is called Benford's law, and may be paraphrased like this:

Given some number randomly generated by a natural process, there is roughly a 30% chance that the first digit of that number is a 1. There is a decreasing chance of the number starting with each successively higher digit, culminating with a 5% probability that the first digit is a nine.

The law is exceptionally simple, but nonetheless surprising - shouldn't every digit have an equal chance of leading off a number? And yet it holds true for countless datasets: RBIs in a season, mountain heights, CEO salaries, even the street addresses of the first 342 names listed in the book American Men of Science.  In fact, your own address has a 30% chance of beginning with a 1!

In the course of trying to source the exact math behind this phenomenon, I was struck by how few of the explanations floating around the internet described this phenomenon in plain English, and in the words of Tom Lehrer, "I have a modest example here."

Newcomb's Discovery

Back in 1881 (I hope you didn't seriously expect me to get straight to the point), an astronomer named Simon Newcomb was looking up values in a book of logarithmic tables.  Logarithms are, of course, a rather boring piece of math that people like me delight over and people who trade bonds used to, before they began trusting computers to be delighted on their behalf. A logarithm answers the question, "10 raised to what power equals X?" So the log of 10 is 1; the log of 100 is 2; and so forth.  Logs of numbers which don't begin with a 1 and end with many 0's are extremely difficult to calculate by hand, and in the days before calculators mathematicians would keep books of logarithmic tables for referencing the solution to these tedious functions. The pages at the front contained the logarithms of numbers starting with the number one (like 1, 1.2, 1.5, etc.) and the back pages list which start with the number nine (like 9, 9.2, 9.5, etc.)  This was all that was needed, since a nice property of logarithms is that log(xy) =log(x)+log(y), so the log of a large number like 927 (which is 2.967) can be simply restated as log(9.27) + log(100), or 0.967 + 2. You might note a connection to scientific notation, which might come in handy later.

Anyway, Newcomb was looking through just such a book when he noticed that the pages at the front of the book were more worn than the pages near the end (the pre-digital equivalent of planned obsolescence). This was very surprising, for it suggested that he was looking up the logarithms of small numbers far more than those of large numbers. He would have expected that he would look up numbers uniformly randomly throughout the book. Instead, numbers with low first digits seemed to predominate his work.  Newcomb was so intrigued by this that he examined the frequency of each first digit and even published a paper containing a formula describing their empirical relationships:

p(n) = log(n+1)-log(n)

(P(n) means "the probability that a given number begins with the digit n.") I find it interesting that Newcomb came up with a logarithmic law because his book of logarithms was dirty.  It is even more interesting, however, that Newcomb's proposed relationship was actually correct, though he did not prove so at the time. In fact, following its publication Newcomb's observation was promptly forgotten for nearly half a century, which - hard as it may be to believe - should give you some sense of how exciting the world of mathematics was at the time.

Benford's Law

In 1938, Newcomb's puzzle resurfaced, and strangely enough in exactly the same manner. A scientist named Frank Benford rediscovered the pattern after noticing how worn the early pages of his logarithm book had become. Benford proceeded to comb 20,000 datasets for evidence of the strange pattern, examining all manner of measurements including the aforementioned 342 street addresses, molecular weights, even numbers appearing in newspaper articles.  Everywhere, he found the same result: numbers began with the number 1 about 30% of the time.

Benford ultimately rederived Newcomb's formula, and it has been known as Benford's law ever since. But the law remained elusive, as a mathematically rigorous proof was not derived until 1995 by Theodore Hill - more than a century after the original finding (and long after the last logarithmic book was printed).

Benford's law is most useful (aside from being a curiosity) in detecting fraud, a trick it's actually quite good at because humans tend to ignore Benford's law when inventing numbers.  In fact, we typically do our best to make sure no digit is more present than any other. This is because we are very bad at dealing with randomness. Theodore Hill used to ask half his students to record 200 coin flips, and ask the other half to invent 200 fake coin flips.  He would then collect their work and - with 95% accuracy - identify the fake records.  How?  People trying to mimic randomness never put 6 heads or tails in a row, since such a pattern seems "unrandom."  But 6 in a row is actually just as "random" as any other combination, and occurs in 200 flips with surprisingly high probability.  Benford's law may be applied similarly, and has been used to catch tax and corporate fraud, since the invented numbers invariably fail to align with Benford's predicted frequencies.

Why low numbers are more likely

The central question is, why should a number be more likely to start with a relatively small digit than with a relatively higher one; aren't digits distributed evenly?

Logically, the probability of choosing a certain first digit is the count of numbers beginning with that digit divided by the count of all the numbers in your dataset. Since we can't know the numbers in the dataset a priori, let's assume every number is equally likely, and our "dataset" is a subset of the positive integers.

We begin with the simplest case, in which the dataset is just the number 1.  Here the probability of choosing a 1 as the first digit is 1/1, or 100%.  Expanding the dataset to include a 2 drops the probability of choosing a 1 to 50%, and adding a 3 reduces the chances to 33%. By the time a 9 is sequentially added, the probability of choosing a 1 is a lowly 11%.  But watch what happens when a 10 is added: the probability of selecting a 1 jumps back to 20%.  And why stop at 10? After 11, the probability of getting a 1 is up to 27%.  Meanwhile, the probability of selecting every other number is falling.

As we keep expanding the dataset - through the 20's and 30's - the low numbers become more and more frequent, while the probability of choosing a high number falls. If the scale runs to 78, the chance of picking a 9 is just 1%, and 9 will remain less likely than every other number until the dataset includes 99. But reaching 99 opens the door to a run of one hundred numbers starting with 1, and so the low numbers quickly dominate again.

Exponential growth

The problem with the thought experiment we just performed is that each time we raise the scale passes through another power of 10 (meaning 1's, then 10's, then 100's, etc.), the numbers in that order dominate previous findings because there are ten times as many observations. This is a problem because our probabilities are based on counts of observations.

The simplest way to overcome this problem is to alter our methodology somewhat: instead of adding 1 to our dataset each step, let's add 10%. Start with the number 1, and increase it 10% to 1.1.  Another 10% gets you to 1.21, and so on. After 7 increases, you will finally have a number starting with a 2.  Another 3 increases and you have a first digit 3. By the time we reach another number beginning with a 1, the empirical frequency of seeing a 1 is slightly more than 30% and a 9 is 4% - fairly close to Benford's logarithmic law.  As we continue counting, the numbers will converge exactly.

What we've done is ensure that our measurements are scale invariant by taking few observations as the numbers get larger.  Previously we increased the dataset arithmetically, now we increase it exponentially (which matches how the number line grows). This property ensures that just because there are far more numbers starting with a 1 between 1000-1999 than there are numbers from 10-19, we don't give them extra weight in our calculation.

Logarithmic scaling

We've established that our number scale grows exponentially, which is to say by powers of 10. Taking fewer observations by growing our dataset exponentially gets us to the right answer, but it is not an elegant solution in the sense that it relies on "brute force" - count, measure probability, repeat. We are forced to do this because the number of observations of each digit increases as the number line grows, and our probability calculation depends on the number of observations.

Instead, let's rescale the number line using a logarithm to eliminate exponential growth entirely (recall that a logarithm is the mathematical opposite of a power of 10). In a logarithmic scale, the weight given to the numbers from 1 to 10 is the same as the weight given to the numbers 10 to 100. Witness the following:

logs1

Think of the horizontal width of each block as the weight given to an observation of that number.  In a normal arithmetic scale, each block has equal weight - the jump from 1 to 2 is the same as that from 101 to 102. Conversely, the logarithmic scale de-emphasizes larger numbers.  You may think of the logarithmic blocks as getting smaller because the next 90 blocks have to fit in the same horizontal space as the first 9; the weight given to 10-99 is equal to the weight on 1-9.  Here is the scale expanded to include the next power of 10, meaning all the numbers through 99:

logs2

Note that all the 10's take up exactly as much space on the scale as the number 1; the 20's take up as much space as 2, etc. This is exactly what we wanted, because we are only interested in the relative frequencies of the first digit. Based on this chart, the probability of getting a first digit 1 is simply the amount of space horizontal space it takes up (i.e. the weight given to it) divided by the total length of the chart. This is an elegant way of thinking of it, without resorting to counting observations.

We can determine the length of any logarithmic block by subtracting its logarithm from the logarithm of the block to its right.  Therefore,the total length of the logarithm chart from 1 through 9 is 1, because log(1) = 0 and log(10) = 1. The logarithm of 2 is .301, so 1's take up .301/1 , or 30.1% of the logarithmic numberline.

If we want to include all numbers through 99, the math is almost identical. The total length of the numberline is 2, because log(100) = 2.  We know already that the number 1 takes up .301 units, and we've stated that the 10's take up exactly the same amount of space, but let's prove it out anyway.  Log(20) = 1.301 and log(10) = 1, so the 10's take up .301 as expected.  The total area taken up by numbers starting with 1 is then (.301 + .301) / 2, or 30.1%.

It should be clear that in the logarithmic scale we never need to deal with numbers greater than 9, because it merely repeats. So, the math to determine the probability of seeing a first digit simplifies nicely: the probability of observing a first digit n is the width of n's logarithmic block, which is the difference between the log of the number above n and the log of n:

p(n) = log(n+1)-log(n)

Hey -- that's Benford's law!

Also, I should note that with a little manipulation you can adjust Benford's law to work in counting systems that aren't base-10 like ours. I leave that as an exercise for the reader, which, curiously enough, is what my textbooks always used to say right when I needed them most.

So, now what?

Well, that's really it - a (relatively) plain English derivation of a relatively obscure piece of math that has confounded mathematicians for a very long time. If you've actually read all the way down here, then I hope you're the sort of person (read: math nerd) for whom the explanation will be sufficient reward in and of itself; if you've struggled to the bottom of this post and find yourself disappointed at this time, maybe it's time to lay off the math.

An interesting point is that you really don't need to waste all the time wading through the "counting observations" stuff; you can skip straight to the logarithms with this insight: scale invariance is only satisfied by logarithmic scales. I hinted at this earlier.  If you have rivers measured in miles, and convert that to feet, or inches, Benford's law of first digits must still hold true.  This means that a scale must be used with the property of compressing large numbers, and a logarithmic scale is the one which meets that requirement. I only include the counting part because for someone totally unfamiliar with the idea of number scales, the logarithmic twist would be a bit much coming out of nowhere. The counting method is a fairly friendly way of legging into the idea that first digit probabilities rely heavily on the scale you choose.

Benford's law will come up in very surprising places, but there are a few places it will not come up - the populations of congressional districts, for example.  That's because such numbers are manipulated to a certain level, and can not be thought of as random or having been generated by a random process. For appropriate numbers, however, the law can provide a simple validity check that a dataset has not been tampered with. Indeed, even after you follow all the math, it is still somewhat surprising that the number of times the number 1 appears in a list can be enough evidence to declare that list fraudulent.

Just don't go too crazy with all the Benford-ing.

People might think you're a little weird.

Leave a Comment

{ 1 trackback }

Previous post:

Next post: