Photo finish: the Netflix prize

July 28, 2009 in Math

A month ago, the million dollar Netflix prize was finally won by a coalition of leading teams called Bellkor's Pragmatic Chaos, who blended their respective methods into a super-algorithm that finally crossed the 10% improvement barrier.

...or was it?

The 10% mark sent the competition into a final, 30-day countdown, during which time other teams could submit scores. After 30 days, the contest would be over. With just one day remaining before the contest ended, a bunch of other teams turned the tables on the leaders and formed their own blended coalition, The Ensemble. The Ensemble eeked out a 0.01% improvement over BPC's score, landing them squarely in first place.

A scant 20 minutes before the deadline, BPC submitted a new entry which tied The Ensemble for first place - but with just 4 minutes remaining in the years-long competition, The Ensemble put up a new, marginally-higher score, sealing first place.

...or did they? (yes, two plot twists!)

It looks preliminarily like BPC may have won the competition. The contest was structured to avoid a common problem in statistics: overfitting. Overfitting occurs when a model is trained to a dataset in such a way that it is not able to describe similar data outside the original set. Overfit models are useless for forecasting.

The Netflix prize avoided overfitting by providing a two datasets, a training set and a test dataset: competitors used the training set to build and optimize their model, but scores were based on the model's fit to the test set. Thus, an overfit model would fail. But it didn't end there - scores were actually based on only half of the test set. These scores determined when the contest ended, but not the winning team; the winners would be the team that had the best score on the hidden half of the test set once the evaluation period ended.

So, The Ensemble managed to get the highest score on the public half of the test set but it seems that BPC may actually have the marginally higher score on the hidden half, which is the one that really matters. The implication in BPC's blog post (which has a much better summary than I) is that The Ensemble overfit the test set - I don't think that will end up as the most accurate assessment, however. The Ensemble will surely score highly on the hidden half; overfitting would mean they couldn't describe it well at all.

I still find it somewhat amazing that in a contest that lasted for three years, the conclusion boils down to a few minutes and mere decimals. But I guess that's what happens when you tell a bunch of nerds you'll give them \$1mm to do what they love to do anyway...

Previous post:

Next post: