Posts tagged as:

statistics

From an NYT article on Google’s translation services, this excerpt sums up the most critical transition in machine learning that has happened thus far:

Creating a translation machine has long been seen as one of the toughest challenges in artificial intelligence. For decades, computer scientists tried using a rules-based approach — teaching the computer the linguistic rules of two languages and giving it the necessary dictionaries.

But in the mid-1990s, researchers began favoring a so-called statistical approach. They found that if they fed the computer thousands or millions of passages and their human-generated translations, it could learn to make accurate guesses about how to translate new texts.

{ 0 comments }

The mathematician’s lens

January 25, 2010 in Data, Math

A beautiful article in the NYTimes contrasts abstract mathematics with the chilling reality of the Mexican drug cartel wars:

I was born in Mexico City, in a world that seems less and less familiar to me. I live now in the opposite corner of the continent. I am training to be a political scientist at Harvard. My passion has remained the afflictions of my homeland, but at Harvard I have found new ways to address them, to use mathematical models — matrices, vectors, equations, regressions — to understand the Mexican drug crisis.

The cartel wars are extremely violent, and the gangs are responsible for reprehensible kidnappings and deaths. They rank among the most deadly periods of organized crime in human history. The author’s goal isn’t to explain how she can analyze the wars from up in an ivory tower; it’s to describe how her mindset and toolkit inform her understanding of the world in any situation.

The article captured me because it never mentions what the author actually models. Instead, it presents her frightened thoughts and her efforts to calm herself by looking at the world through a mathematical lens. But it’s not what you think; there are no emotionally-distant mathematicians here. The author communicates her fascination with tying reality to abstract models, expecting and preempting the protest that reality is too complex and math too simple:

In this violent world, with the man in the blue Chevy whispering at me behind the window, math is my shield. Speaking up about drugs is in these parts a dangerous game. But not if you speak in the language of sigma and conditional expectations. Math protects me from the immediacy of the violence, and it protects me from them.

The beauty of my method lies in its simplicity. With mathematics I’m able to codify and simplify reality to make it manageable and, more important, malleable. I represent each possible individual as an equation in which each term symbolizes tastes, goals, profession and abilities. All people get portrayed: Policemen, politicians, citizens and drug cartels start living in this mathematical world as planes and hyperplanes and, as in real life, they interact and affect one another, sometimes colluding, sometimes colliding, sometimes neither.

I then use optimization to predict the form of interaction that will be the most probable to emerge and remain over time. Math starts speaking. It tells me, for example, under what conditions the outcome would be a drug war; when would the government prefer to cooperate with cartels; or when cruel intra-cartel purges will become the norm.

There is a part of every modeler’s mind which is constantly teasing out variables from constants. The statisticians among us may take a frequentist view, and wonder what would happen if a scene played itself out a million times; the programmers will deduce the underlying algorithms from the fuzzy result; the pure mathematicians will see manifolds everywhere:

In this abstract microcosmos, reality can be frozen or just slightly changed. I move and look at my hyperplanes from different angles. Let’s change the penalty code. No, let’s increase patrolling. Or reduce wages. Allow less contact between policemen and dealers. Assume the police force is corrupt. Assume it is not. I solve the equations and there it is. My answers come as Greek letters and probabilities.

But we all admit:

I know, I know, this is weird.

Ultimately, “free will” becomes the clarion of the independent. At least, it’s the best response to this explanation:

It may seem strange to examine this shadowy world with equations. But mathematics is transforming the social sciences. In the same way that physicists can predict the movement of atoms in space, we can use mathematics to model how individuals and groups will make decisions and interact in a society.

But free will has a (somewhat tentative) analogue in Heisenberg’s uncertainty principle, and with that philosophy and math (or theology and physics) are combined — but there’s been plenty of pop-sci written on that topic.

I found this brief article remarkable in how it was able to demonstrate the overlay mathematical thought on an extremely “human” subject without ever needing to explain either one.

(Via Drew Conway)

{ 0 comments }

Suggestions

January 24, 2010 in Data

(Via Piled Higher and Deeper)

{ 0 comments }

Never more true than today

January 10, 2010 in Quotes

In his Chart Wars talk, Alex Lundry mentions a quote which he attributes to H. G. Wells:

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.

However, that statement was actually made by Samuel Wilks, who was paraphrasing a line in Wells’ book Mankind in the Making:

The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex worldwide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write.

{ 0 comments }

More mainstream Bayesians

December 20, 2009 in Math

The NYT recently ran an article on the math behind the recent and controversial mammogram advisory change. Unsurprisingly, it is heavily centered on a Bayesian argument. Of course, the key point here is not that the statistics dictated the change, but that budgets and political agendas dictated an acceptable level, which the statistics subsequently informed:

Let’s suppose 100,000 screenings for this cancer are conducted. Of these, how many are positive? On average, 500 of these 100,000 people (0.5 percent of 100,000) will have cancer, and so, since 95 percent of these 500 people will test positive, we will have, on average, 475 positive tests (.95 x 500). Of the 99,500 people without cancer, 1 percent will test positive for a total of 995 false-positive tests (.01 x 99,500 = 995). Thus of the total of 1,470 positive tests (995 + 475 = 1,470), most of them (995) will be false positives, and so the probability of having this cancer given that you tested positive for it is only 475/1,470, or about 32 percent! This is to be contrasted with the probability that you will test positive given that you have the cancer, which by assumption is 95 percent.

{ 0 comments }

Professor Risk

December 13, 2009 in Math, Risk

David Spiegelhalter is the Professor of the Public Understanding of Risk at Cambridge University. He has recently produced the following video to encourage better practices in the casual perception of risky behaviors:

YouTube Preview Image

I think it’s a brilliant video and would love to have been one of Professor Spegelhalter’s students. I firmly believe that the study of risk and statistics more generally suffers more than anything from a particularly awful and dare I say boring curriculum, not to mention one which many teachers choose to render in terms beyond the grasp of many students. Efforts like this go a long way toward alleviating that obstacle and I applaud the professor for his work.

{ 0 comments }

Psychologist Daniel Wright has published a list of ten statisticians every psychologist should know.

The list is comprised of The Founding Fathers:

1. Karl Pearson – who established statistics as an academic discipline
2. Ronald Fisher – who developed much of statistics’ mathematical foundation, including ANOVA and maximum likelihood, and the importance of p-values
3. Jerzy Neyman – who developed the null and alternative hypothesis framework and confidence intervals

A selection of Statistical Heroes:

4. John Tukey – who legitimized the use of graphs in science and developed robust statistical methods
5. Donal Rubin – who developed methods of establishing causality
6. Brad Efron – who developed the bootstrap resampling method

And four statisticians who devised Particularly Useful Techniques:

7. David Cox – who developed methods of transforming data along with George Box
8. Leo Goodman – who advanced categorical data analysis
9. John Nelder – who developed generalized linear models
10. Robert Tibshirani – who developed the lasso

I’m not sure if this is exactly the same list I’d come up with if it were aimed at statisticians rather than psychologists (Bayes is notably absent), but it’s an excellent overview nonetheless and the paper is worth a few minutes’ read.

{ 0 comments }

Living in a Bayesian world

October 30, 2009 in Math

Increasingly, I’ve noted in my discussions with statisticians and practitioners a reliance on Bayesian methods. Bayesian statistics rely on an understanding of the uncertainty of a hypothesis. For example, Bayesian hypotheses are literally updated as new information becomes available. Bayesian analyses will also rely heavily on conditional probabilities, or the understanding of likelihoods that depend on the occurrence of related events. One of the biggest Bayesian proponents is Professor Andrew Gelman, who maintains an excellent blog and is involved in fivethirtyeight.com.

In some ways, Bayesian methods have become a bit fad-like and, as with many fads (I’m looking at you, VaR), there should be concern that they will be applied blindly, without thought. Like anything else, it’s possible to do Bayesian statistics wrong – and even extremely wrong – but when wielded correctly, they make for an excellent investigative resource.

New Scientist has an article on the use – and misuse – of probability in criminal cases. Naturally, it focuses on Bayesian statistics. The key point the article makes is that while it’s important to consider the odds of something happening, it is just as critical to account for the odds of it happening by chance. That may seem contradictory (isn’t an event’s likelihood, by definition, the probability it happens by chance?) so let’s use a classic example, lifted from the article:

You have just tested positive for a disease that affects 1 in every 10,000 people. The test is 99% accurate. On the surface, that sounds like a sound diagnosis, and most people would say they are 99% confident that they do, in fact, have the disease. But consider the following: if every one of the 10,000 people took the same test, then 1 of them would yield a true positive and 99 more would exhibit false positives just by chance. Therefore, among people who have tested positive, there is only a 1% chance of actually having the disease – not the 99% likelihood we naively assumed before!

How does that work – wasn’t there only a 1% chance of the test being wrong? Well, yes – but if you think about it, that 1% chance of error is much larger than the 0.01% chance of having the disease in the first place and the test result must be placed in that context. For the more spatial readers, here is a picture from New Scientist:

The false positive problem is a classic textbook example of how Bayesian reasoning (that is, accounting for the ways in which chance can manifest itself) can affect a seemingly obvious result. It’s a very important consideration which could be overlooked without care. And besides, it makes for interesting pop sci articles.

{ 0 comments }

Suspicious poll distributions

September 25, 2009 in Data, Math

I’ve covered Benford’s method for first-digit fraud analysis before, and now Nate Silver has applied a similar method to polling results. He looked at the last digit of various polls (i.e. a 48% McCain, 49% Obama, 3% undecided poll would be recorded as an 8 and a 9) and compiled histograms of their frequencies. Following up on his suspicions that all was not right with one polling firm in particular, Nate noticed that their results did not conform to the expected random distribution:

This data is not random at all. For instance, the trailing digit was ‘8′ on 676 occasions, almost 60 percent more often than the 431 times that it was ‘1′. Over a sample of more than 5,000 data points, such an outcome occurring by chance alone would be an incredible fluke — millions to one against. Bad luck can essentially be ruled out as an explanation.

One of two things seems to have happened, then.

One possibility is that there is some intrinsic, mathematical reason that certain trailing digits are more likely to come up than others. This is certainly possible — and in fact, it would be somewhat likely if the polling data that we were looking at were homogeneous — McCain versus Obama polls in Ohio, for instance.

But Strategic Vision’s polls cover a wide array of topics: Presidential horse race numbers in any of a dozen or so states, senate and gubernatorial polling, primary polling, approval ratings of various kinds, polling on issues like the war in Iraq, and more abstract questions such as whether voters think that ‘experience’ or ‘change’ is the more important quality in a Presidential candidate. No one type of question, in no one state, represents more than a relatively small fraction of the sample. Under those circumstances, I can’t think of any reason why the trailing digit wouldn’t approach being random — although there absolutely might be reasons that I haven’t thought of.

But this data is not random. It’s not close to random. It’s not close to close. Which brings up the other possibility: Strategic Vision is cooking the books. And whoever is doing so is doing a pretty sloppy job. They’d seem to have a strong, unconscious preference for numbers ending in ‘7′, for instance, as opposed to those ending in ‘6′. They tend to go with round numbers that end in ‘5′ or ‘0′ slightly too often. And they much prefer numbers with high trailing digits like 49 and 38 to those with low ones like 51 and 42.

I haven’t really seen anyone approach polling data like this before, and I certainly haven’t done so myself. So, we cannot rule out the possibility that there is some mathematical rationale for this that I haven’t thought of. But it looks really, really bad. There is a substantial possibility — far from a certainty — that much of Strategic Vision’s polling over the past several years has been forged.

Is there a mathematical reason for such a discrepency in poll results? I can think of only one possibility – there is a weak dependence structure in the data. One last digit exerts some influence over the other. With no one undecided, the dependence is perfect: a 4 on one sample requires a 6 on the other (i.e. 44% and 56%; 34% and 66%). With a fixed level of undecided people, the dependence remains perfect. If the undecided level is stochastic, then the dependence becomes more weak. However, it’s unclear to me why this would skew the results in the manner this firm exhibits; this could require high numbers to be paired with low numbers, or high numbers to be paired with high numbers (or vice versa), but wouldn’t lead to more high numbers in and of itself.

I’m very curious (and not just as a statistician) to see what comes of this… as much as we’d like statistics to give us firm answers, often the limit of its ability is to reveal probable courses of investigation, or lend strong – but at some level uncertain – backing to an argument.

{ 0 comments }

Lottery math is not so easy

September 23, 2009 in Math

Carl Bialik has written about lottery coincidences in his WSJ print column and on The Numbers Guy blog, inspired of course by the recent consecutive draws in the Bulgarian lottery. Addressing my recent confusion, he sheds a little light on why likelihood estimates varied so much:

The probability of Bulgaria’s repeated winning numbers became a subject of some disagreement. A Bulgarian mathematician estimated the probability at 1 in 4.2 million, a figure that was widely reported. Clio Cresswell, a mathematician at the University of Sydney in Australia, came up with 1 in 14 million. Many others arrived at 1 in 5.2 million.

One explanation for the wide range is that Bulgaria has multiple lotteries. Dr. Cresswell’s calculations relied on a different Bulgarian lottery with numbers ranging from 1 to 49. Mr. Smith and others made their calculations assuming the possible numbers went up to 42, the correct range for this particular lottery. As for the 1-in-4.2 million estimate, the Bulgarian mathematician didn’t respond to requests for comment.

The blog post in particular is full of really interesting links – I especially enjoyed Professor Leonard Stefanski’s account (pdf) of trying to reconcile accurate statistics with the media’s desire for sensationalism.

{ 0 comments }

Adventures in probability

September 17, 2009

A funny thing happened in the Bulgarian national lottery this week: the same numbers were drawn as last week.
The BBC and the AP both report the odds at 1 in 4 million; ABC Australia calls it 1 in 14 million. People are demanding that the Bulgarian lottery perform an investigation because no one can believe the [...]

4 comments Read the full post →

Junk Maths

September 10, 2009

Via Andrew Gelman, I’ve learned that the BBC has a radio programme (as they would say write) called More or Less which is dedicated to statistics. The first bit of the most recent one is called “Junk Maths” (and again, I wish I could have taken a class called “maths”) with the following synopsis:
Spurious formulae [...]

0 comments Read the full post →

Modelling interactions

August 18, 2009

Andrew Gelman’s latest post highlights the importance of interactions. He includes this breakdown of where people fall depending on political party, ideology, and income:
Consider the income dimension. Among liberals, the income curve is flat no matter whether the person is a Democrat, Independent or Republican. For conservatives, however, income has a large effect – in [...]

0 comments Read the full post →

Deconstructing the Gaussian copula, part III

August 11, 2009

The intuition behind copula models: dependence, correlation, single factors and more.

0 comments Read the full post →

Statistics: desired and feared?

August 10, 2009

My former department chair, Xiao-Li Meng, has published an excellent article on the emergent role of statistics and the challenge of teaching the science to non-statisticians. He addresses the negative perception of the field, often ingrained by a poor high school experience and summed up in a dismissive scoff that “the best speaker in statistics” [...]

0 comments Read the full post →

Dronish number nerds

August 6, 2009

It’s still not too late for Stats 101: The NYTimes published an article this morning titled “For Today’s Graduate, Just One Word: Statistics.” Of course I love to see articles like this, cognizant of the massive amounts of data we are faced we and acknowledging the efforts of the people trying to sort it all out:
In field [...]

0 comments Read the full post →

Photo finish: the Netflix prize

July 28, 2009

A month ago, the million dollar Netflix prize was finally won by a coalition of leading teams called Bellkor’s Pragmatic Chaos, who blended their respective methods into a super-algorithm that finally crossed the 10% improvement barrier.
…or was it?
The 10% mark sent the competition into a final, 30-day countdown, during which time other teams could submit scores. [...]

0 comments Read the full post →

Evaluating returns to social media

July 21, 2009

A collusion by wetpaint and the Altimeter Group has resulted in a fanciful study on social media. Normally, a paper like this wouldn’t be worth addressing, but the amount of attention being paid to its questionable conclusion warrants a closer look. And that conclusion is:
[T]his landmark study has found that the most valuable brands in the [...]

5 comments Read the full post →

On teaching math

June 29, 2009

Arthur Benjamin gives a short (3 minute) TED talk on the problems with how math is taught to high school students in America. He notes that the current curriculum is a sequence beginning with arithmatic and leading to the ultimate goal of calculus. But calculus isn’t something most people use once they graduate – how [...]

0 comments Read the full post →

Inferred ratings and modelling teacher comments

June 24, 2009

Another aspect of my conversation dealt with inferred ratings, a problem I’ve crossed before in other areas. There are two primary cases in which this arises: censored data and self-selection bias.
In the first case of censored data, a problem is caused by the ratings system not eliciting useful responses. An example is a system in [...]

0 comments Read the full post →

Personalized Yelp ratings

June 24, 2009

I had a great conversation last night which at one point verged into the pros and cons of various ratings systems. In particular, we discussed the “star+comment” system used by Yelp, in which between 1 and 5 stars can be assigned in addition to a text comment of arbitrary length.
Yelp does some clever things with [...]

1 comment Read the full post →

When worlds collide

June 17, 2009

I just learned from Andrew Gelman that Mandelbrot wrote a paper on taxonomies… in 1955.

0 comments Read the full post →

Random forecasts (with echoes!)

June 15, 2009

And speaking of forecasts, I’m reminded today of one of my favorite forecasting errors: the echo. This morning, the manufacturing survey missed the forecasted amount, and many pundits commented that it contributed heavily to the market’s fall.
Here is a plot of the manufacturing survey level as reported each month in red (prior to any revisions, [...]

0 comments Read the full post →

Truth in advertising?

June 15, 2009

I find this graph very interesting, not just because of any implied political statements, but for how it highlights the absurdity of economic forecasting and the potentially misguided trust we place in such numbers.
The blue lines were circulated by Obama’s economic team when they were pitching the stimulus bill in order to illustrate its beneficial [...]

0 comments Read the full post →

Illustrating the importance of data visualization

June 12, 2009

Andrew Gelman discusses research on attitudes toward gay marriage, by state, and notes this graph in particular, which shows the change in opinion over the last 15 years:

Critically, he points out that the states which experienced the greatest change in attitude were the ones that already were most receptive. A naive analysis of the data [...]

0 comments Read the full post →

Critiquing the Crimson

June 9, 2009

The Harvard Crimson has published its annual senior survey, which is making headlines in part because very few seniors are going into finance. Selected results were presented in an interesting visualization (the image below links to a full size pdf):

Now that my brother has graduated after successfully steering the Crimson’s business operations to one of [...]

0 comments Read the full post →

Lies, damn lies…

June 2, 2009

A fascinating look at the politics of government statistics from Carl Bialik’s WSJ blog.

0 comments Read the full post →

LARS and the lasso

May 28, 2009

I just came across a paper on LARS, the linear model selection algorithm that’s sweeping the nation. The mathematically and/or masochistically inclined may view it here.*
Ok, so it’s not quite that popular, but it is being heralded as one of the biggest advances in linear modelling in a few decades – and that’s saying a lot [...]

1 comment Read the full post →

Urban mathematics

May 20, 2009

Zipf’s law is another mathematical phenomenon not entirely unrelated to Benford’s law (in fact, some think that Benford is a special case of Zipf). (Aside, it’s funny how after you discuss something, it seems to pop up everywhere – Kahneman and Tversky would have a lot to say on that, I’m sure.) Zipf’s law is [...]

0 comments Read the full post →

Visualizing randomness

May 19, 2009

Daniel Becker’s diploma dissertation was on the visualization of randomness – finding concrete ways to map the highly abstract idea of random behaviors and patterns. The resulting portfolio is fascinating, even for someone without a statistical background, in particular for the way in which it lends a semblance of order to these inherently chaotic processes.
The [...]

2 comments Read the full post →