Speech recognition (is more prevalent than you think)

June 25, 2010 in Data,Technology

The NYT has published the second article in their "Smarter Than You Think" series on artificial intelligence (TGR covered the first here and again here). This time, the focus is on speech recognition and natural language processing.

A couple passages really stood out to me in this more abbreviated overview of the technology:

Computers with artificial intelligence can be thought of as the machine equivalent of idiot savants. They can be extremely good at skills that challenge the smartest humans, playing chess like a grandmaster or answering “Jeopardy!” questions like a champion. Yet those skills are in narrow domains of knowledge. What is far harder for a computer is common-sense skills like understanding the context of language and social situations when talking — taking turns in conversation, for example.

Today's artificial intelligences are extremely narrow in scope. That's not a bad thing, it's part of the development process. To draw a hardware analogy, we don't yet have a "complete" robot, but we do have lots of robots that are very good at small tasks: walking, running, grasping, lifting, expressions, recognition, speech, etc. The challenge in both spheres will be to construct a gestalt device capable of doing all things well. Until then, I'm afraid C-3Po will remain fiction.

A machine capable of complete interaction with our world will draw from a host of intelligence systems -- and will have to incorporate some form of meta-intelligence in order to make sense of them all. Sony's PlayStation 3  has a "Reality Synthesizer" chip, and though the current tech doesn't quite live up to its name (marketing is what marketing is, after all), future generations of smart machines will indeed need processors that can produce complete characterizations of the real world.

There's also a note in line with my observation yesterday that AI is very literally in its infancy:

The AT&T researchers worked with thousands of hours of recorded calls to the Panasonic center, in Chesapeake, Va., to build statistical models of words and phrases that callers used to describe products and problems, and to create a database that is constantly updated. “It’s a baby, and the more data you give it, the smarter it becomes,” said Mazin Gilbert, a speech technology expert at AT&T Labs.

Finally, there's mention of people adjusting their speech to address the computers:

Some callers, especially younger ones, also make things easier for the computer by uttering a key phrase like “plasma help,” Mr. Szczepaniak said. “I call it the Google-ization of the customer,” he said.

This is really interesting. While it is no doubt important for speech recognition software to handle everyday speech, I believe that in the future we will interact with computers "differently" than we do with people. This will be for convenience more than anything else; some part speed and some part efficient phrasing. I don't type natural language queries into Google; I type a series of keywords that best represent my queries. I've learned what sort of keywords get the best search results through experience. In a sense, I do Google's parsing for it -- I choose the most statistically interesting words and present those (no need for "the" or "is" or other words unlikely to enhance my results). Can I imagine a fully natural-language Google? Of course. But I'd still (if possible) just give it the fragmented keywords. Why waste the time and risk confusion?

I know that I look like an idiot - I write these posts about how amazing artificial intelligence is and how it's going to change everything, and then I insist that we will still treat it as if it were "stupid," using keywords instead of complete sentences. It's a matter of efficiency. Until the gestalt computer is born (and I think that's a long way away), then we will have to continue to subsidize each AI's weaknesses with our own intelligence. I think Google does a great job of retrieving search results; I'm not that impressed with its natural language parsing. Therefore, I do the parsing myself. This is why I think it's silly that the NYT article mentions programming virtual assistants to ask about the Mariners game -- that conversation is doomed to be unsatisfying. The assistant is primed for speech recognition, not speech generation -- it can only respond with relatively few predetermined phrases. Unless I'm asking for Ichiro's batting average with runners in scoring position in the second half of the game, I'm not going to get much utility out of a speech recognition device. A machine capable of holding a conversation is yet a step further away. And a machine capable of faithfully executing spoken instruction (without a set of preprogrammed directives - thank you very much iPhone voice control) is yet to be conceived.

But lest I sound like an AI bear - I couldn't be happier that the NYT is running this series and I'm looking forward to part three.

p.s. The comments on the article read like a collection of the most paranoid, tin-hat, anti-machine delusions I've ever had the displeasure of reading. The educational push it's going to take to get society to embrace artificial intelligence is significant... and we thought CDO's were a tough pill to swallow!

Leave a Comment

Previous post:

Next post: