Posts tagged as:

probability

He clearly didn’t give it 110%

November 17, 2009 in Math

Silicon Alley Insider is running a series of posts called “15 _______ questions that will make you feel stupid.” The blank has been filled twice with “Google interview” and most recently with “management consultant interview.”  I particularly enjoyed one of the Google questions:

If the probability of observing a car in 30 minutes on a highway is 0.95, what is the probability of observing a car in 10 minutes (assuming constant default probability)?

I have no idea how the word “default” snuck in there – I’m guessing whoever wrote this had a need to relate things back to dangerous CDS! – but the question is a good one. However, the answers posted on the site are absolutely horrendous. One ardent commentator wrote:

“observing a car in 30 minutes on a highway” 
If 30 min = 95%, then 100% probability = 30/0.95 => 31.5 min 
(ie, the max interval between 2 cars could be 31.5 min) 
Probability in 10 min = 10/31.5

You have to wonder if, by his logic, there’s really a 110% chance of seeing a car in 34.7 minutes?

The correct answer is below…

  1. The probability of observing no cars in 30 minutes is 1-95%, or 5%
  2. The probability of observing no cars in 10 minutes, p, must agree with the statement p^3 = 5%, since three consecutive carless 10 minute periods will pass with 5% probability.
  3. Therefore, p = 36.8%
  4. And the probability of observing a car in 10 minutes is 1-p, or 63.2%.

{ 0 comments }

Living in a Bayesian world

October 30, 2009 in Math

Increasingly, I’ve noted in my discussions with statisticians and practitioners a reliance on Bayesian methods. Bayesian statistics rely on an understanding of the uncertainty of a hypothesis. For example, Bayesian hypotheses are literally updated as new information becomes available. Bayesian analyses will also rely heavily on conditional probabilities, or the understanding of likelihoods that depend on the occurrence of related events. One of the biggest Bayesian proponents is Professor Andrew Gelman, who maintains an excellent blog and is involved in fivethirtyeight.com.

In some ways, Bayesian methods have become a bit fad-like and, as with many fads (I’m looking at you, VaR), there should be concern that they will be applied blindly, without thought. Like anything else, it’s possible to do Bayesian statistics wrong – and even extremely wrong – but when wielded correctly, they make for an excellent investigative resource.

New Scientist has an article on the use – and misuse – of probability in criminal cases. Naturally, it focuses on Bayesian statistics. The key point the article makes is that while it’s important to consider the odds of something happening, it is just as critical to account for the odds of it happening by chance. That may seem contradictory (isn’t an event’s likelihood, by definition, the probability it happens by chance?) so let’s use a classic example, lifted from the article:

You have just tested positive for a disease that affects 1 in every 10,000 people. The test is 99% accurate. On the surface, that sounds like a sound diagnosis, and most people would say they are 99% confident that they do, in fact, have the disease. But consider the following: if every one of the 10,000 people took the same test, then 1 of them would yield a true positive and 99 more would exhibit false positives just by chance. Therefore, among people who have tested positive, there is only a 1% chance of actually having the disease – not the 99% likelihood we naively assumed before!

How does that work – wasn’t there only a 1% chance of the test being wrong? Well, yes – but if you think about it, that 1% chance of error is much larger than the 0.01% chance of having the disease in the first place and the test result must be placed in that context. For the more spatial readers, here is a picture from New Scientist:

The false positive problem is a classic textbook example of how Bayesian reasoning (that is, accounting for the ways in which chance can manifest itself) can affect a seemingly obvious result. It’s a very important consideration which could be overlooked without care. And besides, it makes for interesting pop sci articles.

{ 0 comments }

Lottery math is not so easy

September 23, 2009 in Math

Carl Bialik has written about lottery coincidences in his WSJ print column and on The Numbers Guy blog, inspired of course by the recent consecutive draws in the Bulgarian lottery. Addressing my recent confusion, he sheds a little light on why likelihood estimates varied so much:

The probability of Bulgaria’s repeated winning numbers became a subject of some disagreement. A Bulgarian mathematician estimated the probability at 1 in 4.2 million, a figure that was widely reported. Clio Cresswell, a mathematician at the University of Sydney in Australia, came up with 1 in 14 million. Many others arrived at 1 in 5.2 million.

One explanation for the wide range is that Bulgaria has multiple lotteries. Dr. Cresswell’s calculations relied on a different Bulgarian lottery with numbers ranging from 1 to 49. Mr. Smith and others made their calculations assuming the possible numbers went up to 42, the correct range for this particular lottery. As for the 1-in-4.2 million estimate, the Bulgarian mathematician didn’t respond to requests for comment.

The blog post in particular is full of really interesting links – I especially enjoyed Professor Leonard Stefanski’s account (pdf) of trying to reconcile accurate statistics with the media’s desire for sensationalism.

{ 0 comments }

Adventures in probability

September 17, 2009 in Math, News

A funny thing happened in the Bulgarian national lottery this week: the same numbers were drawn as last week.

The BBC and the AP both report the odds at 1 in 4 million; ABC Australia calls it 1 in 14 million. People are demanding that the Bulgarian lottery perform an investigation because no one can believe the result. Now, it’s not every day this sort of probability question comes up in the news, let’s take a second to walk through the problem.

The Bulgarian lottery format consists of 6 numbers drawn from a collection of 42, without replacement.  Let’s start by considering the probability of observing any single combination of balls. You might begin like this: The first ball could be any of 42. Once it is chosen, the next ball could be any of the remaining 41. After that, the third ball has 40 possibilities, etc. The number of possible outcomes is therefore

42\times41\times40\times39\times38\times37=3.8 \textrm{ billion.}

However, this would consider 1 – 2 – 3 – 4 – 5 – 6 to be different from 6 – 5 – 4 – 3 – 2 – 1, which is wrong because order does not matter in the Bulgarian lottery. To figure out the correct number of combinations, we instead need to use the choose function, or the binomial coefficient.

The choose function \binom{n}{k} is pronounced as “n choose k” and yields the number of ways that samples of size k can be chosen from a population of size n, if order does not matter. This is exactly what we are looking for – how many combinations of 6 balls can be formed from a group of 42, irrespective of order? The answer is

\binom{42}{6} = 5.2\textrm{ million.}

So the chance of seeing any specific outcome in the lottery (or, put another way, the chance of winning the lottery) is 1 in 5.2mm.

What’s the probability of seeing the same outcome in two consecutive weeks? One’s first impulse might be to say it’s 1 in 5.2 million squared, or nearly 1 in 28 trillion. But, unsurprisingly, our first instinct is wrong. The chance of seeing a specific combination of balls (such as 1 – 2 – 3 – 4 – 5 – 6) in two consecutive weeks is indeed 1 in 28 trillion. However, the chance of seeing any combination of consecutive draws is… 1 in 5.2 million.

How so? Start with what we know: for a given outcome, the chance is 1 in 28 trillion of seeing it twice in a row. But there are 5.2 million possible outcomes, any of which could have a double header. Thus, the math for any outcome repeating is the 1 in 28 trillion chance of a repeat times the 5.2 million different outcomes, for a final likelihood of 1 in 5.2 million.

This holds true for any population with n choices – and is always one factor smaller than what the human brain naively believes. Consider a coin flip. It has two outcomes, heads and tails (n = 2). After f flips, there are n^f possible outcomes, and exactly n of those outcomes exhibit the same result in every flip. This is because the number of outcomes increases geometrically but the number of repeated items can never exceed the number of initial states. So, after two flips there are four outcomes (HH, TT, HT, TH) with two repeated results and after three flips there are eight outcomes (HHH, TTT, HHT, HTT, HTH, THT, THH, TTH) and still just two repeats. We may conclude that after any number of flips, the probability of seeing a repeated outcome is

\frac{n}{n^f}=\frac{1}{n^{f-1}}.

After two coin flips, the probability is 2 in 4 or 50%; after three flips it is 2 in 8 or 25%. These simple cases extend nicely to the case where n = 5.2 million and f = 2, where it should be 5.2 million in 5.2 million squared. That simplifies back to 1 in 5.2 million.

I prefer to think of the problem like this: “Given a draw in the first week, what’s the probability of seeing that draw again?” In other words, conditional on the first week’s number, what’s the probability that the second week’s number is the same as the first? Since all combinations are equally likely, the first week’s numbers have only a 1 in 5.2 million chance of being drawn the second week as well. Some people may not like this logic because they feel it ignores the 5.2 million outcomes in the first week, but they actually are accounted for. By conditioning on the first week, we no longer need to consider all of its possibilities.

Let’s take it one step further. These calculations gives the probability of seeing consecutive draws in a two week span, but ignore the fact that this lottery is played every single week. In a year, that’s 51 chances at getting a consecutive draw – surely that improves the odds of observing this result! The probability of not drawing consecutive outcomes in any two weeks is

\frac{\binom{42}{6} -1}{\binom{42}{6}}= 0.9999998.

It is expressed as the number of non-consecutive outcomes divided by the total number of outcomes. Unsurprisingly, it is 5.2 million-less-one to 1 against. After 51 weeks, the probability of not seeing any consecutive outcomes is that number to the 51st power, or 0.9999903. Therefore, the probability of at least one repeat during the year is the complementary probability, or just more than 1 in 100 thousand. Take that out over a number of years and the odds of observing repeats continues to increase. Remember that even the most unlikely event has a fairly high chance of being observed if the outcome is run many times. If you asked the question, “What’s the probability of seeing consecutive draws at any point in the history of the Bulgarian lottery?” you might find the answer surprisingly high. It’s just the probability of this specific week in September 2009 being the repeat which is so low.

The most interesting thing to me is that no one won the lottery the first week but a record 18 people won the second week – it would appear that playing the previous week’s numbers is a strategy people follow. Since the numbers are independent, there’s nothing smart or foolish about this from a probabilistic standpoint, though if you employed a little psychology you would stay away from “popular” numbers in order to avoid sharing the pot if you won.

Anyway, I digress. I put the probability of this observed outcome at 1 in 5.2 million. I’m curious to know how the other news agencies calculated their odds; perhaps there’s some twist in the lottery I’m unaware of?

{ 4 comments }

When worlds collide

June 17, 2009 in Math

I just learned from Andrew Gelman that Mandelbrot wrote a paper on taxonomies… in 1955.

{ 0 comments }

Recently a favorite statistical anomaly of mine came up in the course of my being constantly and utterly fascinated by our universe. It is called Benford’s law, and may be paraphrased like this:

Given some number randomly generated by a natural process, there is roughly a 30% chance that the first digit of that number is a 1. There is a decreasing chance of the number starting with each successively higher digit, culminating with a 5% probability that the first digit is a nine.

The law is exceptionally simple, but nonetheless surprising – shouldn’t every digit have an equal chance of leading off a number? And yet it holds true for countless datasets: RBIs in a season, mountain heights, CEO salaries, even the street addresses of the first 342 names listed in the book American Men of Science.  In fact, your own address has a 30% chance of beginning with a 1!

In the course of trying to source the exact math behind this phenomenon, I was struck by how few of the explanations floating around the internet described this phenomenon in plain English, and in the words of Tom Lehrer, “I have a modest example here.”

Newcomb’s Discovery

Back in 1881 (I hope you didn’t seriously expect me to get straight to the point), an astronomer named Simon Newcomb was looking up values in a book of logarithmic tables.  Logarithms are, of course, a rather boring piece of math that people like me delight over and people who trade bonds used to, before they began trusting computers to be delighted on their behalf. A logarithm answers the question, “10 raised to what power equals X?” So the log of 10 is 1; the log of 100 is 2; and so forth.  Logs of numbers which don’t begin with a 1 and end with many 0’s are extremely difficult to calculate by hand, and in the days before calculators mathematicians would keep books of logarithmic tables for referencing the solution to these tedious functions. The pages at the front contained the logarithms of numbers starting with the number one (like 1, 1.2, 1.5, etc.) and the back pages list which start with the number nine (like 9, 9.2, 9.5, etc.)  This was all that was needed, since a nice property of logarithms is that log(xy) =log(x)+log(y), so the log of a large number like 927 (which is 2.967) can be simply restated as log(9.27) + log(100), or 0.967 + 2. You might note a connection to scientific notation, which might come in handy later.

Anyway, Newcomb was looking through just such a book when he noticed that the pages at the front of the book were more worn than the pages near the end (the pre-digital equivalent of planned obsolescence). This was very surprising, for it suggested that he was looking up the logarithms of small numbers far more than those of large numbers. He would have expected that he would look up numbers uniformly randomly throughout the book. Instead, numbers with low first digits seemed to predominate his work.  Newcomb was so intrigued by this that he examined the frequency of each first digit and even published a paper containing a formula describing their empirical relationships:

p(n) = log(n+1)-log(n)

(P(n) means “the probability that a given number begins with the digit n.”) I find it interesting that Newcomb came up with a logarithmic law because his book of logarithms was dirty.  It is even more interesting, however, that Newcomb’s proposed relationship was actually correct, though he did not prove so at the time. In fact, following its publication Newcomb’s observation was promptly forgotten for nearly half a century, which – hard as it may be to believe – should give you some sense of how exciting the world of mathematics was at the time.

Benford’s Law

In 1938, Newcomb’s puzzle resurfaced, and strangely enough in exactly the same manner. A scientist named Frank Benford rediscovered the pattern after noticing how worn the early pages of his logarithm book had become. Benford proceeded to comb 20,000 datasets for evidence of the strange pattern, examining all manner of measurements including the aforementioned 342 street addresses, molecular weights, even numbers appearing in newspaper articles.  Everywhere, he found the same result: numbers began with the number 1 about 30% of the time.

Benford ultimately rederived Newcomb’s formula, and it has been known as Benford’s law ever since. But the law remained elusive, as a mathematically rigorous proof was not derived until 1995 by Theodore Hill – more than a century after the original finding (and long after the last logarithmic book was printed).

Benford’s law is most useful (aside from being a curiosity) in detecting fraud, a trick it’s actually quite good at because humans tend to ignore Benford’s law when inventing numbers.  In fact, we typically do our best to make sure no digit is more present than any other. This is because we are very bad at dealing with randomness. Theodore Hill used to ask half his students to record 200 coin flips, and ask the other half to invent 200 fake coin flips.  He would then collect their work and – with 95% accuracy – identify the fake records.  How?  People trying to mimic randomness never put 6 heads or tails in a row, since such a pattern seems “unrandom.”  But 6 in a row is actually just as “random” as any other combination, and occurs in 200 flips with surprisingly high probability.  Benford’s law may be applied similarly, and has been used to catch tax and corporate fraud, since the invented numbers invariably fail to align with Benford’s predicted frequencies.

Why low numbers are more likely

The central question is, why should a number be more likely to start with a relatively small digit than with a relatively higher one; aren’t digits distributed evenly?

Logically, the probability of choosing a certain first digit is the count of numbers beginning with that digit divided by the count of all the numbers in your dataset. Since we can’t know the numbers in the dataset a priori, let’s assume every number is equally likely, and our “dataset” is a subset of the positive integers.

We begin with the simplest case, in which the dataset is just the number 1.  Here the probability of choosing a 1 as the first digit is 1/1, or 100%.  Expanding the dataset to include a 2 drops the probability of choosing a 1 to 50%, and adding a 3 reduces the chances to 33%. By the time a 9 is sequentially added, the probability of choosing a 1 is a lowly 11%.  But watch what happens when a 10 is added: the probability of selecting a 1 jumps back to 20%.  And why stop at 10? After 11, the probability of getting a 1 is up to 27%.  Meanwhile, the probability of selecting every other number is falling.

As we keep expanding the dataset – through the 20’s and 30’s – the low numbers become more and more frequent, while the probability of choosing a high number falls. If the scale runs to 78, the chance of picking a 9 is just 1%, and 9 will remain less likely than every other number until the dataset includes 99. But reaching 99 opens the door to a run of one hundred numbers starting with 1, and so the low numbers quickly dominate again.

Exponential growth

The problem with the thought experiment we just performed is that each time we raise the scale passes through another power of 10 (meaning 1’s, then 10’s, then 100’s, etc.), the numbers in that order dominate previous findings because there are ten times as many observations. This is a problem because our probabilities are based on counts of observations.

The simplest way to overcome this problem is to alter our methodology somewhat: instead of adding 1 to our dataset each step, let’s add 10%. Start with the number 1, and increase it 10% to 1.1.  Another 10% gets you to 1.21, and so on. After 7 increases, you will finally have a number starting with a 2.  Another 3 increases and you have a first digit 3. By the time we reach another number beginning with a 1, the empirical frequency of seeing a 1 is slightly more than 30% and a 9 is 4% – fairly close to Benford’s logarithmic law.  As we continue counting, the numbers will converge exactly.

What we’ve done is ensure that our measurements are scale invariant by taking few observations as the numbers get larger.  Previously we increased the dataset arithmetically, now we increase it exponentially (which matches how the number line grows). This property ensures that just because there are far more numbers starting with a 1 between 1000-1999 than there are numbers from 10-19, we don’t give them extra weight in our calculation.

Logarithmic scaling

We’ve established that our number scale grows exponentially, which is to say by powers of 10. Taking fewer observations by growing our dataset exponentially gets us to the right answer, but it is not an elegant solution in the sense that it relies on “brute force” – count, measure probability, repeat. We are forced to do this because the number of observations of each digit increases as the number line grows, and our probability calculation depends on the number of observations.

Instead, let’s rescale the number line using a logarithm to eliminate exponential growth entirely (recall that a logarithm is the mathematical opposite of a power of 10). In a logarithmic scale, the weight given to the numbers from 1 to 10 is the same as the weight given to the numbers 10 to 100. Witness the following:

logs1

Think of the horizontal width of each block as the weight given to an observation of that number.  In a normal arithmetic scale, each block has equal weight – the jump from 1 to 2 is the same as that from 101 to 102. Conversely, the logarithmic scale de-emphasizes larger numbers.  You may think of the logarithmic blocks as getting smaller because the next 90 blocks have to fit in the same horizontal space as the first 9; the weight given to 10-99 is equal to the weight on 1-9.  Here is the scale expanded to include the next power of 10, meaning all the numbers through 99:

logs2

Note that all the 10’s take up exactly as much space on the scale as the number 1; the 20’s take up as much space as 2, etc. This is exactly what we wanted, because we are only interested in the relative frequencies of the first digit. Based on this chart, the probability of getting a first digit 1 is simply the amount of space horizontal space it takes up (i.e. the weight given to it) divided by the total length of the chart. This is an elegant way of thinking of it, without resorting to counting observations.

We can determine the length of any logarithmic block by subtracting its logarithm from the logarithm of the block to its right.  Therefore,the total length of the logarithm chart from 1 through 9 is 1, because log(1) = 0 and log(10) = 1. The logarithm of 2 is .301, so 1’s take up .301/1 , or 30.1% of the logarithmic numberline.

If we want to include all numbers through 99, the math is almost identical. The total length of the numberline is 2, because log(100) = 2.  We know already that the number 1 takes up .301 units, and we’ve stated that the 10’s take up exactly the same amount of space, but let’s prove it out anyway.  Log(20) = 1.301 and log(10) = 1, so the 10’s take up .301 as expected.  The total area taken up by numbers starting with 1 is then (.301 + .301) / 2, or 30.1%.

It should be clear that in the logarithmic scale we never need to deal with numbers greater than 9, because it merely repeats. So, the math to determine the probability of seeing a first digit simplifies nicely: the probability of observing a first digit n is the width of n’s logarithmic block, which is the difference between the log of the number above n and the log of n:

p(n) = log(n+1)-log(n)

Hey — that’s Benford’s law!

Also, I should note that with a little manipulation you can adjust Benford’s law to work in counting systems that aren’t base-10 like ours. I leave that as an exercise for the reader, which, curiously enough, is what my textbooks always used to say right when I needed them most.

So, now what?

Well, that’s really it – a (relatively) plain English derivation of a relatively obscure piece of math that has confounded mathematicians for a very long time. If you’ve actually read all the way down here, then I hope you’re the sort of person (read: math nerd) for whom the explanation will be sufficient reward in and of itself; if you’ve struggled to the bottom of this post and find yourself disappointed at this time, maybe it’s time to lay off the math.

An interesting point is that you really don’t need to waste all the time wading through the “counting observations” stuff; you can skip straight to the logarithms with this insight: scale invariance is only satisfied by logarithmic scales. I hinted at this earlier.  If you have rivers measured in miles, and convert that to feet, or inches, Benford’s law of first digits must still hold true.  This means that a scale must be used with the property of compressing large numbers, and a logarithmic scale is the one which meets that requirement. I only include the counting part because for someone totally unfamiliar with the idea of number scales, the logarithmic twist would be a bit much coming out of nowhere. The counting method is a fairly friendly way of legging into the idea that first digit probabilities rely heavily on the scale you choose.

Benford’s law will come up in very surprising places, but there are a few places it will not come up – the populations of congressional districts, for example.  That’s because such numbers are manipulated to a certain level, and can not be thought of as random or having been generated by a random process. For appropriate numbers, however, the law can provide a simple validity check that a dataset has not been tampered with. Indeed, even after you follow all the math, it is still somewhat surprising that the number of times the number 1 appears in a list can be enough evidence to declare that list fraudulent.

Just don’t go too crazy with all the Benford-ing.

People might think you’re a little weird.

{ 1 comment }

Obviously the most exciting thing that happened last week was the publication of a popular article all about statistics. Yes, listeners (readers?), the NYTimes reviewed a possible flaw in a well-known psychology experiment based on the mathematics of a well-known game show prize.

Of course I’m referring to the Monty Hall problem, which used to be an interesting riddle (in fact, it used to be an interesting game show) and now has become more of a novelty. Apparently, however, solving it remains the defining criteria to join an elite blackjack squad.

The problem goes like this: Monty Hall shows you three doors, one of which has a car behind it. The other two hide goats. All you have to do is pick the right door. The trick is that after you make a selection, Monty opens a door you did not choose and reveals one of the goats. He then gives you the option to stay with the door you originally chose or switch to the (sole) other door. So the question is, Should You Stay or Should You Go?

The intuitive answer is that it doesn’t matter – there are two doors, one with a car and one with a goat, so it should be 50-50. But since I said that’s intuitive, you know it must be wrong. In fact, it turns out you should always switch to the other door.

When you made your original selection, you had a 1/3 chance of being correct. When Monty eliminates a door, you STILL only have a 1/3 chance of being right. That’s the tricky realization: because you choose before the elimination, the elimination does not affect your probability of being correct. However if you change your choice, then you can take advantage of the fact that there was a 2/3 chance your original pick was wrong. Earlier, you couldn’t make use of that 2/3 probability because it was divided evenly over two other doors. Now, there is only one “other door,” so the 2/3 probability that a door other than yours is correct has been concentrated in one place. Switch, and you double your chances of winning.

And so The Clash was quite correct:

If I go there will be trouble / And if I stay there will be double.

(I can only assume they were referring to the probability of picking a goat.)

Honorary mention goes to the Numbers Guy at the WSJ, who ran an actual probability quiz in his blog last week.

{ 0 comments }