From an NYT article on Google’s translation services, this excerpt sums up the most critical transition in machine learning that has happened thus far:
Creating a translation machine has long been seen as one of the toughest challenges in artificial intelligence. For decades, computer scientists tried using a rules-based approach — teaching the computer the linguistic rules of two languages and giving it the necessary dictionaries.
But in the mid-1990s, researchers began favoring a so-called statistical approach. They found that if they fed the computer thousands or millions of passages and their human-generated translations, it could learn to make accurate guesses about how to translate new texts.
A beautiful article in the NYTimes contrasts abstract mathematics with the chilling reality of the Mexican drug cartel wars:
I was born in Mexico City, in a world that seems less and less familiar to me. I live now in the opposite corner of the continent. I am training to be a political scientist at Harvard. My passion has remained the afflictions of my homeland, but at Harvard I have found new ways to address them, to use mathematical models — matrices, vectors, equations, regressions — to understand the Mexican drug crisis.
The cartel wars are extremely violent, and the gangs are responsible for reprehensible kidnappings and deaths. They rank among the most deadly periods of organized crime in human history. The author’s goal isn’t to explain how she can analyze the wars from up in an ivory tower; it’s to describe how her mindset and toolkit inform her understanding of the world in any situation.
The article captured me because it never mentions what the author actually models. Instead, it presents her frightened thoughts and her efforts to calm herself by looking at the world through a mathematical lens. But it’s not what you think; there are no emotionally-distant mathematicians here. The author communicates her fascination with tying reality to abstract models, expecting and preempting the protest that reality is too complex and math too simple:
In this violent world, with the man in the blue Chevy whispering at me behind the window, math is my shield. Speaking up about drugs is in these parts a dangerous game. But not if you speak in the language of sigma and conditional expectations. Math protects me from the immediacy of the violence, and it protects me from them.
The beauty of my method lies in its simplicity. With mathematics I’m able to codify and simplify reality to make it manageable and, more important, malleable. I represent each possible individual as an equation in which each term symbolizes tastes, goals, profession and abilities. All people get portrayed: Policemen, politicians, citizens and drug cartels start living in this mathematical world as planes and hyperplanes and, as in real life, they interact and affect one another, sometimes colluding, sometimes colliding, sometimes neither.
I then use optimization to predict the form of interaction that will be the most probable to emerge and remain over time. Math starts speaking. It tells me, for example, under what conditions the outcome would be a drug war; when would the government prefer to cooperate with cartels; or when cruel intra-cartel purges will become the norm.
There is a part of every modeler’s mind which is constantly teasing out variables from constants. The statisticians among us may take a frequentist view, and wonder what would happen if a scene played itself out a million times; the programmers will deduce the underlying algorithms from the fuzzy result; the pure mathematicians will see manifolds everywhere:
In this abstract microcosmos, reality can be frozen or just slightly changed. I move and look at my hyperplanes from different angles. Let’s change the penalty code. No, let’s increase patrolling. Or reduce wages. Allow less contact between policemen and dealers. Assume the police force is corrupt. Assume it is not. I solve the equations and there it is. My answers come as Greek letters and probabilities.
But we all admit:
I know, I know, this is weird.
Ultimately, “free will” becomes the clarion of the independent. At least, it’s the best response to this explanation:
It may seem strange to examine this shadowy world with equations. But mathematics is transforming the social sciences. In the same way that physicists can predict the movement of atoms in space, we can use mathematics to model how individuals and groups will make decisions and interact in a society.
But free will has a (somewhat tentative) analogue in Heisenberg’s uncertainty principle, and with that philosophy and math (or theology and physics) are combined — but there’s been plenty of pop-sci written on that topic.
I found this brief article remarkable in how it was able to demonstrate the overlay mathematical thought on an extremely “human” subject without ever needing to explain either one.
(Via Drew Conway)
In his Chart Wars talk, Alex Lundry mentions a quote which he attributes to H. G. Wells:
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.
However, that statement was actually made by Samuel Wilks, who was paraphrasing a line in Wells’ book Mankind in the Making:
The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex worldwide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write.
December 20, 2009 in Math
The NYT recently ran an article on the math behind the recent and controversial mammogram advisory change. Unsurprisingly, it is heavily centered on a Bayesian argument. Of course, the key point here is not that the statistics dictated the change, but that budgets and political agendas dictated an acceptable level, which the statistics subsequently informed:
Let’s suppose 100,000 screenings for this cancer are conducted. Of these, how many are positive? On average, 500 of these 100,000 people (0.5 percent of 100,000) will have cancer, and so, since 95 percent of these 500 people will test positive, we will have, on average, 475 positive tests (.95 x 500). Of the 99,500 people without cancer, 1 percent will test positive for a total of 995 false-positive tests (.01 x 99,500 = 995). Thus of the total of 1,470 positive tests (995 + 475 = 1,470), most of them (995) will be false positives, and so the probability of having this cancer given that you tested positive for it is only 475/1,470, or about 32 percent! This is to be contrasted with the probability that you will test positive given that you have the cancer, which by assumption is 95 percent.
David Spiegelhalter is the Professor of the Public Understanding of Risk at Cambridge University. He has recently produced the following video to encourage better practices in the casual perception of risky behaviors:
I think it’s a brilliant video and would love to have been one of Professor Spegelhalter’s students. I firmly believe that the study of risk and statistics more generally suffers more than anything from a particularly awful and dare I say boring curriculum, not to mention one which many teachers choose to render in terms beyond the grasp of many students. Efforts like this go a long way toward alleviating that obstacle and I applaud the professor for his work.
Psychologist Daniel Wright has published a list of ten statisticians every psychologist should know.
The list is comprised of The Founding Fathers:
1. Karl Pearson – who established statistics as an academic discipline
2. Ronald Fisher – who developed much of statistics’ mathematical foundation, including ANOVA and maximum likelihood, and the importance of p-values
3. Jerzy Neyman – who developed the null and alternative hypothesis framework and confidence intervals
A selection of Statistical Heroes:
4. John Tukey – who legitimized the use of graphs in science and developed robust statistical methods
5. Donal Rubin – who developed methods of establishing causality
6. Brad Efron – who developed the bootstrap resampling method
And four statisticians who devised Particularly Useful Techniques:
7. David Cox – who developed methods of transforming data along with George Box
8. Leo Goodman – who advanced categorical data analysis
9. John Nelder – who developed generalized linear models
10. Robert Tibshirani – who developed the lasso
I’m not sure if this is exactly the same list I’d come up with if it were aimed at statisticians rather than psychologists (Bayes is notably absent), but it’s an excellent overview nonetheless and the paper is worth a few minutes’ read.
Increasingly, I’ve noted in my discussions with statisticians and practitioners a reliance on Bayesian methods. Bayesian statistics rely on an understanding of the uncertainty of a hypothesis. For example, Bayesian hypotheses are literally updated as new information becomes available. Bayesian analyses will also rely heavily on conditional probabilities, or the understanding of likelihoods that depend on the occurrence of related events. One of the biggest Bayesian proponents is Professor Andrew Gelman, who maintains an excellent blog and is involved in fivethirtyeight.com.
In some ways, Bayesian methods have become a bit fad-like and, as with many fads (I’m looking at you, VaR), there should be concern that they will be applied blindly, without thought. Like anything else, it’s possible to do Bayesian statistics wrong – and even extremely wrong – but when wielded correctly, they make for an excellent investigative resource.
New Scientist has an article on the use – and misuse – of probability in criminal cases. Naturally, it focuses on Bayesian statistics. The key point the article makes is that while it’s important to consider the odds of something happening, it is just as critical to account for the odds of it happening by chance. That may seem contradictory (isn’t an event’s likelihood, by definition, the probability it happens by chance?) so let’s use a classic example, lifted from the article:
You have just tested positive for a disease that affects 1 in every 10,000 people. The test is 99% accurate. On the surface, that sounds like a sound diagnosis, and most people would say they are 99% confident that they do, in fact, have the disease. But consider the following: if every one of the 10,000 people took the same test, then 1 of them would yield a true positive and 99 more would exhibit false positives just by chance. Therefore, among people who have tested positive, there is only a 1% chance of actually having the disease – not the 99% likelihood we naively assumed before!
How does that work – wasn’t there only a 1% chance of the test being wrong? Well, yes – but if you think about it, that 1% chance of error is much larger than the 0.01% chance of having the disease in the first place and the test result must be placed in that context. For the more spatial readers, here is a picture from New Scientist:

The false positive problem is a classic textbook example of how Bayesian reasoning (that is, accounting for the ways in which chance can manifest itself) can affect a seemingly obvious result. It’s a very important consideration which could be overlooked without care. And besides, it makes for interesting pop sci articles.
I’ve covered Benford’s method for first-digit fraud analysis before, and now Nate Silver has applied a similar method to polling results. He looked at the last digit of various polls (i.e. a 48% McCain, 49% Obama, 3% undecided poll would be recorded as an 8 and a 9) and compiled histograms of their frequencies. Following up on his suspicions that all was not right with one polling firm in particular, Nate noticed that their results did not conform to the expected random distribution:

This data is not random at all. For instance, the trailing digit was ‘8′ on 676 occasions, almost 60 percent more often than the 431 times that it was ‘1′. Over a sample of more than 5,000 data points, such an outcome occurring by chance alone would be an incredible fluke — millions to one against. Bad luck can essentially be ruled out as an explanation.
One of two things seems to have happened, then.
One possibility is that there is some intrinsic, mathematical reason that certain trailing digits are more likely to come up than others. This is certainly possible — and in fact, it would be somewhat likely if the polling data that we were looking at were homogeneous — McCain versus Obama polls in Ohio, for instance.
But Strategic Vision’s polls cover a wide array of topics: Presidential horse race numbers in any of a dozen or so states, senate and gubernatorial polling, primary polling, approval ratings of various kinds, polling on issues like the war in Iraq, and more abstract questions such as whether voters think that ‘experience’ or ‘change’ is the more important quality in a Presidential candidate. No one type of question, in no one state, represents more than a relatively small fraction of the sample. Under those circumstances, I can’t think of any reason why the trailing digit wouldn’t approach being random — although there absolutely might be reasons that I haven’t thought of.
But this data is not random. It’s not close to random. It’s not close to close. Which brings up the other possibility: Strategic Vision is cooking the books. And whoever is doing so is doing a pretty sloppy job. They’d seem to have a strong, unconscious preference for numbers ending in ‘7′, for instance, as opposed to those ending in ‘6′. They tend to go with round numbers that end in ‘5′ or ‘0′ slightly too often. And they much prefer numbers with high trailing digits like 49 and 38 to those with low ones like 51 and 42.
I haven’t really seen anyone approach polling data like this before, and I certainly haven’t done so myself. So, we cannot rule out the possibility that there is some mathematical rationale for this that I haven’t thought of. But it looks really, really bad. There is a substantial possibility — far from a certainty — that much of Strategic Vision’s polling over the past several years has been forged.
Is there a mathematical reason for such a discrepency in poll results? I can think of only one possibility – there is a weak dependence structure in the data. One last digit exerts some influence over the other. With no one undecided, the dependence is perfect: a 4 on one sample requires a 6 on the other (i.e. 44% and 56%; 34% and 66%). With a fixed level of undecided people, the dependence remains perfect. If the undecided level is stochastic, then the dependence becomes more weak. However, it’s unclear to me why this would skew the results in the manner this firm exhibits; this could require high numbers to be paired with low numbers, or high numbers to be paired with high numbers (or vice versa), but wouldn’t lead to more high numbers in and of itself.
I’m very curious (and not just as a statistician) to see what comes of this… as much as we’d like statistics to give us firm answers, often the limit of its ability is to reveal probable courses of investigation, or lend strong – but at some level uncertain – backing to an argument.
September 23, 2009 in Math
Carl Bialik has written about lottery coincidences in his WSJ print column and on The Numbers Guy blog, inspired of course by the recent consecutive draws in the Bulgarian lottery. Addressing my recent confusion, he sheds a little light on why likelihood estimates varied so much:
The probability of Bulgaria’s repeated winning numbers became a subject of some disagreement. A Bulgarian mathematician estimated the probability at 1 in 4.2 million, a figure that was widely reported. Clio Cresswell, a mathematician at the University of Sydney in Australia, came up with 1 in 14 million. Many others arrived at 1 in 5.2 million.
One explanation for the wide range is that Bulgaria has multiple lotteries. Dr. Cresswell’s calculations relied on a different Bulgarian lottery with numbers ranging from 1 to 49. Mr. Smith and others made their calculations assuming the possible numbers went up to 42, the correct range for this particular lottery. As for the 1-in-4.2 million estimate, the Bulgarian mathematician didn’t respond to requests for comment.
The blog post in particular is full of really interesting links – I especially enjoyed Professor Leonard Stefanski’s account (pdf) of trying to reconcile accurate statistics with the media’s desire for sensationalism.
Inferred ratings and modelling teacher comments
June 24, 2009Another aspect of my conversation dealt with inferred ratings, a problem I’ve crossed before in other areas. There are two primary cases in which this arises: censored data and self-selection bias.
In the first case of censored data, a problem is caused by the ratings system not eliciting useful responses. An example is a system in [...]