Inside Google's Age of Augmented Humanity: Part 1, New Frontiers of Speech Recognition
Editor’s Note: This is Part 1 of a three-part story that we originally published on January 3, 5, and 6, 2011. We’re highlighting it today because the series was just named by Longform.org as one of its top technology stories of 2011.
Already, it’s hard for anyone with a computer to get through a day without encountering Google, whether that means doing a traditional Web search, visiting your Gmail inbox, calling up a Google map, or just noticing an ad served up by Google Adsense. And as time goes on, it’s going to get a lot harder.
That’s in part because the Mountain View, CA-based search and advertising giant has spent years building and acquiring technologies that extend its understanding beyond Web pages to other genres of information. I’m not just talking about the obvious, high-profile Google product areas such as browsers and operating systems (Chrome, Android), video (YouTube and the nascent Google TV), books (Google Book Search, Google eBooks), maps (Google Maps and Google Earth), images (Google Images, Picasa, Picnik), and cloud utilities (Google Docs). One layer below all of that, Google has also been pouring resources into fundamental technologies that make meaning more machine-tractable—including software that recognizes human speech, translates written text from one language to another, and identifies objects in images. Taken together, these new capabilities promise to make all of Google’s other products more powerful.
The other reason Google will become harder to avoid is that many of the company’s newest capabilities are now being introduced and perfected first on mobile devices rather than the desktop Web. Already, our mobile gadgets are usually closest at hand when we need to find something out. And their ubiquity will only increase: it’s believed that 2011 will be the year when sales of smartphones and tablet devices finally surpass sales of PCs, with many of those new devices running Android.
That means you’ll be able to tap Google’s services in many more situations, from the streets of a foreign city, where Google might keep you oriented and feed you a stream of factoids about the surrounding landmarks, to the restaurant you pick for lunch, where your phone might translate your menu (or even your waiter’s remarks) into English.
Google CEO Eric Schmidt says the company has adopted a “mobile first” strategy. And indeed, many Googlers seem to think of mobile devices and the cameras, microphones, touchscreens, and sensors they carry as extensions of our own awareness. “We like to say a phone has eyes, ears, skin, and a sense of location,” says Katie Watson, head of Google’s communications team for mobile technologies. “It’s always with you in your pocket or purse. It’s next to you when you’re sleeping. We really want to leverage that.”
This is no small vision, no tactical marketing ploy—it’s becoming a key part of Google’s picture of the future. In a speech last September at the IFA consumer electronics fair in Berlin, Schmidt talked about “the age of augmented humanity,” a time when computers remember things for us, when they save us from getting lost, lonely, or bored, and when “you really do have all the world’s information at your fingertips in any language”—finally fulfilling Bill Gates’ famous 1990 forecast. This future, Schmidt says, will soon be accessible to everyone who can afford a smartphone—one billion people now, and as many as four billion by 2020, in his view.
It’s not that phones themselves are all that powerful, at least compared to laptop or desktop machines. But more and more of them are backed up by broadband networks that, in turn, connect to massively distributed computing clouds (some of which, of course, are operated by Google). “It’s like having a supercomputer in your pocket,” Schmidt said in Berlin. “When we do voice translation, when we do picture identification, all [the smartphone] does is send a request to the supercomputers that then do all the work.”
And the key thing about those supercomputers—though Schmidt alluded to it only briefly—is that they’re stuffed with data, petabytes of data about what humans say and write and where they go and what they like. This data is drawn from the real world, generated by the same people who use all of Google’s services. And the company’s agility when it comes to collecting, storing, and analyzing it is perhaps its greatest but least appreciated capability.
The power of this data was the one consistent theme in a series of interviews I conducted in late 2010 with Google research directors in the fundamental areas of speech recognition, machine translation, and computer vision. It turns out that many of the problems that have stymied researchers in cognitive science and artificial intelligence for decades—understanding the rules behind grammar, for instance, or building models of perception in the visual cortex—give way before great volumes of data, which can simply be mined for statistical connections.
Unlike the large, structured language corpuses used by the speech-recognition or machine-translation experts of yesteryear, this data doesn’t have to be transcribed or annotated to yield insights. The structure and the patterns arise from the way the data was generated, and the contexts in which Google collects it. It turns out, for example, that meaningful relationships can be extracted from search logs—the more people who search for “IBM stock price” or “Apple Computer stock price,” the clearer it becomes that there is a class of things, i.e. companies, with an attribute called “stock price.” Google’s algorithms glean this from Google’s own users in a process computer scientists call “unsupervised learning.”
“This is a form of artificial intelligence,” Schmidt observed in Berlin. “It’s intelligence where the computer does what it does well and it helps us think better…The computer and the human, together, each does something better because the other is helping.”
In a series of three articles this week, I’ll look more closely at this human-computer symbiosis and how Google is exploiting it, starting with the area of speech recognition. (Subsequent articles will examine machine translation and computer vision.) Research in these areas is advancing so fast that the outlines of Schmidt’s vision of augmented humanity are already becoming clear, especially for owners of Android phones, where Google deploys its new mobile technologies first and most deeply.
Obviously, Google has competition in the market for mobile information services. Over time, its biggest competitor in this area is likely to be Apple, which controls one of the world’s most popular smartphone platforms and recently acquired, in the form of a startup called Siri, a search and personal-assistant technology built on many of the same machine-learning principles espoused by Google’s researchers.
But Google has substantial assets in its favor: a large and talented research staff, one of the world’s largest distributed computing infrastructures, and most importantly, a vast trove of data for unsupervised learning. It seems likely, therefore, that much of the innovation making our phones more powerful over the coming years will emerge from Mountain View.
The Linguists and the Engineers
Today Michael Cohen leads Google’s speech technology efforts. But he actually started out as a composer and guitarist, making a living for seven years writing music for piano, violin, orchestra, and jazz bands. As a musician, he says, he was always interested the mechanics of auditory perception—why certain kinds of sound make musical sense to the human brain, while others are just noise.
A side interest in computer music eventually led him into computer science proper. “That very naturally led me, first of all, to wanting to work on something relating to perception, and second, related to sounds,” Cohen says today. “And the natural thing was speech recognition.”
Cohen started studying speech at Menlo Park’s SRI International in 1984, as the principal investigator in a series of DARPA-funded studies in acoustic modeling. By that time, a fundamental change in the science of speech was already underway, he says. For decades, early speech researchers had hoped that it would be possible to teach computers to understand speech by giving them linguistic knowledge—general rules about word usage and pronunciation. But starting in the 1970s, an engineering-oriented camp had emerged that rejected this approach as impractical. “These engineers came along, saying, ‘We will never know everything about those details, so let’s just write algorithms that can learn from data,'” Cohen recounts. “There was friction between the linguists and the engineers, and the engineers were winning by quite a bit.”
But around the mid-1980s, Cohen says, “the linguists and the engineers started talking to each other.” The linguists realized that their rules-based approach was too complex and inflexible, while the engineers realized their statistical models needed more structure. One result was the creation of context-dependent statistical models of speech that, for the first time, could take “co-articulation” into account—the fact that the pronunciation of each phoneme, or sound unit, in a word is influenced by the preceding and following phonemes. There would no longer be just one statistical profile for the sound waves constituting a long “a” sound, for example; there would be different models for “a” for all of the contexts in which it occurs.
“The engineers, to this day, still follow the fundamental statistical, machine-learning, data-driven approaches,” Cohen says. “But by learning a bit about linguistic structure—that words are built in phonemes and that particular realizations of these phonemes are context-dependent—they were able to build richer models that could learn much more of the fine details about speech than they had before.”
Cohen took much of that learning with him when he co-founded Nuance, a Menlo Park, CA-based spinoff of SRI International, in 1994. (Much later, SRI would also spin off Siri, the personal assistant startup bought last year by Apple.) He spent a decade building up the company’s strength in telephone-based voice-response systems for corporate call centers—the kind of technology that lets customers get flight status updates from airlines by speaking the flight numbers, for example.
The Burlington, MA-based company now called Nuance Communications was formerly a Nuance competitor called ScanSoft, and it adopted the Nuance name after it acquired the Menlo Park startup in 2005. But by that time Cohen had left Nuance for Google. He says several factors lured him in. One was the fact that statistical speech-recognition models were inherently limited by computing speed and memory, and by the amount of training data available. “Google had way more compute power than anybody had, and over time, the ability to have way more data than anybody had,” Cohen says. “The biggest bottleneck in the research being, ‘How can we build a much bigger model?,’ it was definitely an opportunity.”
But there were other aspects to this opportunity. After 10 years working on speech recognition for landline telephone systems at Nuance, Cohen wanted to try something different, and “mobile was looking more and more important as a platform, as a place where speech technology would be very important,” he says. That’s mainly because of the user-interface problem: phones are small and it’s inconvenient to type on them.
“At the time, Google had barely any effort in mobile, maybe four people doing part-time stuff,” Cohen says. “In my interviews, I said, ‘I realize you can’t tell me what your next plans are, but if you are not going to be serious about mobile, don’t make me an offer, because I won’t be interested in staying.’ I felt at the time that mobile was going to be a really important area for Google.”
As it turned out, of course, Cohen wasn’t the only one who felt that way. Schmidt and Google co-founders Larry Page and Sergey Brin also believed mobile phones would become key platforms for browsing and other search-related activities, which helped lead to the company’s purchase of mobile operating system startup Android in 2005.
Cohen built a whole R&D group around speech technology. Its first product was goog-411, a voice-driven directory assistance service that debuted in 2007. Callers to 1-800-GOOG-411 could request business listings for all of the United States and Canada simply by speaking to Google’s computers. The main reason for building the service, Cohen says, was to make Google’s local search service available over the phone. But the company also logged all calls to goog-411, which made it “a source of valuable training data,” Cohen says: “Even though goog-411 was a subset of voice search, between the city names and the company names we covered a great deal of phonetic diversity.”
And there was a built-in validation mechanism: if Google’s algorithms correctly interpreted the caller’s prompt, the caller would go ahead and place an actual call. It’s in many such unobtrusive ways (as Schmidt pointed out in his Berlin speech) that Google recruits users themselves to help its algorithms learn.
Google shut down goog-411 in November 2010—but only because it had largely been supplanted by newer products from Cohen’s team such as Voice Search, Voice Input, and Voice Actions. Voice Search made its first appearance in November 2008 as part of the Google Mobile app for the Apple iPhone. (It’s now available on Android phones, BlackBerry devices, and Nokia S60 phones as well.) It allows mobile phone users to enter Google search queries by speaking them into the phone. It’s startlingly accurate, in part because it learns from users. “The initial models were based on goog-411 data and they performed very well,” Cohen says. “Over time, we’ve been able to train with more Voice Search data and get improvements.”
Google isn’t the only company building statistical speech-recognition models that learn from data; Cambridge, MA, startup Vlingo, for example, has built a data-driven virtual assistant for iPhone, Android, BlackBerry, Nokia, and Windows Phone platforms that uses voice recognition to help users with mobile search, text messaging, and other tasks.
But Google has a big advantage: it’s also a search company. Before Cohen joined Google, he says, “they hadn’t done voice search before—but they had done search before, in a big way.” That meant Cohen’s team could use the logs of traditional Web searches at Google.com to help fine-tune its own language models. “If the last two words I saw were ‘the dog’ and I have a little ambiguity about the next word, it’s more likely to be ‘ran’ than ‘pan,'” Cohen explains. “The language models tell you the probabilities of all possible next words. We have been able to train enormous language models for Voice Search because we have so much textual data from Google.com.”
Over time, speech recognition capabilities have popped up in more and more Google products. When Google Voice went public in the spring of 2009, it included a voicemail transcription feature courtesy of Cohen’s team. Early in 2010, YouTube began using Google’s transcription engine to publish written transcripts alongside every YouTube video, and YouTube viewers now have the option of seeing the transcribed text on screen, just like closed-captioning on television.
But mobile is still where most of the action is. Google’s Voice Actions service, introduced last August, lets Android users control their phones via voice—for instance, they can initiate calls, send e-mail and text messages, call up music, or search maps on the Web. (This feature is called Voice Commands on some phones.) And the Voice Input feature on certain Android phones adds a microphone button to the virtual keypad, allowing users to speak within any app where text entry is required.
“In general, our vision for [speech recognition on] mobile is complete ubiquity,” says Cohen. “That’s not where we are now, but it is where we are trying to get to. Anytime the user wants to interact by voice, they should be able to.” That even includes interacting with speakers of other languages: Cohen says Google’s speech recognition researchers work closely with their colleagues in machine translation—the subject of the next article in this series—and that the day isn’t far off when the two teams will be able to release a “speech in, speech out” application that combines speech recognition, machine translation, and speech synthesis for near-real-time translation between people speaking different languages.
“The speech effort could be viewed as something that enhances almost all of Google’s services,” says Cohen. “We can organize your voice mails, we can show you the information on the audio track of a YouTube video, you can do searches by voice. A large portion of the world’s information is spoken—that’s the bottom line. It was a big missing piece of the puzzle, and it needs to be included. It’s an enabler of a much wider array of usage scenarios, and I think that what we’ll see over time is all kinds of new applications that people would never have thought of before,” all of them powered by user-provided training data. Which is precisely what Schmidt had in mind in Berlin when he quoted sci-fi author William Gibson: “Google is made of us, a sort of coral reef of human minds and their products.”
Coming in Part 2: A look at the role of big data in Google’s machine translation effort, led by Franz Josef Och.
[Update, 2/28/11: Click here for a convenient single-page version of all three parts of “Inside Google’s Age of Augmented Humanity.”]