Autoencoders go mainstream

June 27, 2012 in Data,Math,Technology

My inbox has been buzzing with links to an interesting new research paper from a team at Google led by Andrew Ng (of Stanford AI fame) and Jeff Dean. However, I'm receiving far more links to an NYT piece covering the research. It's great that the work is getting mainstream coverage, but somewhat unfortunate because the NYT has managed to invent spurious details about the research while diminishing its real importance.

So I'm stuck wondering: why has the NYT chosen this moment, this research, this team to finally chime in on my little (but rapidly expanding) research area?

This is hardly the first unsupervised image recognition model, nor the first such model built by Google, and certainly not the first one that can recognize cats! (In fact, you can try such a system yourself, right here.)

Is it possible that we have reached a point where anything related to artificial intelligence -- which in this case I define loosely as "computers doing things that the average person didn't know they could do" -- gets prime billing? Is that a good thing? In the case of the ongoing Big Data trend, such spotlighting has led to a ream of "experts" with marginally more knowledge than a high school statistician and a widespread misperception within the tech community about 1) what Big Data is and 2) what can be done with it. I absolutely, under no circumstances, want machine learning to go down the same path.

We are close enough as it is. The response to the NYT article on Twitter -- millions of people spouting nonsense about "brains" and "singularities" -- demonstrates that while most people clearly have no idea what the research represents, they have strong views about it and, more worrisome, think they understand it (or believe that they should appear to do so). The simple proof, to me, is that most people are linking to the NYT article and not the research itself. Normally, I'd have no problem with that, but the article so grossly distorts the research that it is impossible for anyone who is familiar with the paper to recognize the subject of the article at all.

To get a sense of what I'm talking about, simply turn to page 2 (a very suspiciously-broken page 2, at that) of the NYT article, where Dr. Ng states, "A loose and frankly awful analogy is that our numerical parameters correspond to synapses [in the human brain]." The article's author apparently chose to disregard this wisdom, as the first paragraph of the article describes the system as nothing less than "a model of the human brain." Indeed, the most accurate description is probably that the model represents an abstraction of a certain type of process which likely exists in certain parts of the neocortex.

Moreover, the NYT describes the model as being "turned loose on the internet" where it "looked for cats." I never thought I'd accuse any paper of sensationalism with regard to math, of all things, but here it is! The truth is that the model was fed a steady diet of curated YouTube stills, and brought immense processing power to bear in order to recognize common features in those images -- cats being one of many thousands of categories it recognized. And to be clear, this isn't a classification network per se. Each "neuron" becomes sensitive to a certain combination of patterns and shapes. The researchers went looking for a neuron that was particularly sensitive to humans, and another partial to cats -- the firing of that neuron corresponds to classification of an image. A classifier trained on top of this network would likely outperform even these results.

I'd be as guilty as anyone if I didn't address the most exciting aspects of this new research. First let's separate what's new and what isn't, though. The model itself isn't entirely new. It's an autoencoder, an unsupervised model that builds a representation of its inputs. It has been the subject of much research in the last half-decade, particularly by Marc'Aurelio Ranzato -- first at NYU with LeCun, then at Toronto under Hinton-- whose name I was very happy to see attached to this research. I was worried that after his hiring by Google, we'd never hear from him again, and his research and writing is consistently clear and impressive. More than that, it's a deep autoencoder, meaning an autoencoder builds a representation of the data, which is then fed into a second autoencoder, whose output is passed to yet a third autoencoder... all the way through eight layers in total. This provides a mechanism to aggregate detail from the hyperlocal to global scale, in practice passing on only the most salient features to the next level. This is also not new; deep networks have been closely watched since Hinton's critical work in 2006, and layered networks were introduced by LeCun back in 1989 (with restrictions appropriate for technology of the time).

What is new -- and exciting -- is that unlike LeCun's convolutional networks, the weights of this autoencoder did not have to be tied. This allowed spatial invariance to develop to a greater degree than previously possible: the model was able to recognize pictures which were rotated, skewed or inverted from other examples it had seen. This was only possible because of the incredible amount of resources spent on the project -- untying the weights increases the numbers of parameters by a few orders of magnitude, resulting in a billion free parameters in this case. That result -- and the method by which it was implemented -- is the single greatest advance the research represents.

Students of this field will recognize that there is nothing revolutionary about a network that trained itself, nor one that learned to recognize cats, nor one trained as an autoencoder with SGD. And that brings me back to my original question: why has the NYT decided to write about this model at this time? Maybe it's the Google factor -- decades of research from the finest academic institutions is all very well, but it doesn't have the allure of this company's efforts. If so, I'm sure that would come as a sad realization to many of the Grey Lady's target audience. Maybe we've already crossed the "Big Data" event horizon, where writing about trendy nonsense attracts trendy readers, who in turn enforce the network effects the news is so dependent on.

Or perhaps I'm being too harsh. Shouldn't I be ecstatic to see mainstream coverage of something I spend so much time working on? I don't know. I really feel that misinformation is one of the most dangerous things in the world, and a misinformed public will manage to make worse decisions than an uninformed one. My excitement about seeing this article and its Twitter response rapidly faded when I saw that the people tweeting it either didn't read it, didn't understand it, or didn't bother to try at either one. I don't think this is a case where someone has to go through years of literature to get comfortable with the results, either (though I certainly recommend it!). As I mentioned, the page 1 characterizations of the model were at odds with the researchers' own page 2 descriptions -- that should be enough to make anyone suspicious.

There's a fantastic SNL skit in which Steve Martin is told, repeatedly, "Don't buy stuff you can't afford." I'd like to instate a similar rule for Twitter: Don't tweet things you don't understand. Sadly, I'm informed that would eliminate most of their traffic.


Previous post:

Next post: