Inside Google’s Age of Augmented Humanity, Part 3: Computer Vision Puts a “Bird on Your Shoulder”
It’s a staple of every film depiction of killer androids since Terminator: the moment when the audience watches through the robot’s eyes as it scans a human face, compares the person to a photo stored in its memory, and targets its unlucky victim for elimination.
That’s computer vision in action—but it’s actually one of the easiest examples, from a computational point of view. It’s a simple case of testing whether an acquired image matches a stored one. What if the android doesn’t know whether its target is a human or an animal or a rock, and it has to compare everything it sees against the whole universe of digital images? That’s the more general problem in computer vision, and it’s very, very hard.
But just as we saw with the case of statistical machine translation in Part 2 of this series, real computer science is catching up with, and in some cases outpacing, science fiction. And here, again, Google’s software engineers are helping push to the boundaries of what’s possible. Google made its name helping people find textual data on the Web, and it makes nearly all of its money selling text-based ads. But the company also has a deep interest in programming machines to comprehend the visual world—not so that they can terminate people more easily (not until Skynet takes over, anyway) but so that they can supply us with more information about all the unidentified or under-described objects we come across in our daily lives.
I’ve already described how Google’s speech recognition tools help you initiate searches by speaking to your smartphone rather than pecking away at its tiny keyboard. With Google Goggles, a visual search tool that debuted on Android mobile phones in December 2009 and on the Apple iPhone in October 2010, your phone’s built-in camera becomes the input channel, and the images you capture become the search queries. For limited categories of things—bar codes, text on signs or restaurant menus, book covers, famous paintings, wine labels, company logos—Goggles already works extremely well. And Google’s computer vision team is training its software to recognize many more types of things. In the near future, according to Hartmut Neven, the company’s technical lead manager for image recognition, Goggles might be able to tell a maple leaf from an oak leaf, or look at a chess board and suggest your next move.
Goggles is the most experimental, and the most audacious, of the technologies that Google CEO Eric Schmidt described in a recent speech in Berlin as the harbingers of an age of “augmented humanity.” Even more than the company’s speech recognition or machine translation tools, the software that Neven’s team is building—which is naturally tailored for smartphones and other sensor-laden mobile platforms—points toward a future where Google may be at hand to mediate nearly every instance of human curiosity.
“It is indeed not many years out where you can have this little bird looking over your shoulder, interpreting the scenes that you are seeing and pretty much for every piece in the scene—art, buildings, the people around you,” Neven told me in an interview late last year. “You can see that we will soon approach the point where the artificial system knows much more about what you are looking at than you know yourself.”
Neven, like most of the polymaths at Google, started out studying subjects completely unrelated to search. In his case, it was classical physics, followed by a stint in theoretical neurobiology, where he applied methods from statistical physics to understanding how the brain makes sense of information from the nervous system.
“One of most fascinating objects of study in nature is the human brain, understanding how we learn, how we perceive,” Neven says. “Conscious experience is one of the big riddles in science. I am less and less optimistic that we will ever solve them—they’re probably not even amenable to the scientific method. But any step toward illuminating those questions, I find extremely fascinating.”
He sees computer vision as one of the steps. “If you have a theory about how the brain may recognize something, it’s surely nice if you can write a software program that does something similar,” he says. “That by no means proves that the brain does it the same way, but at least you have reached an understanding of how, in principal, it could be done.”
It’s pretty clear that the brain doesn’t interpret optical signals by starting from abstract definitions of what constitutes an edge, a curve, an angle, or a color. Nor does it have the benefit of captions or other metadata. The point—which I won’t belabor again here, since we’ve already seen it at work in the cases of Google’s efforts in speech recognition and machine translation—is that Neven’s approach to image recognition was data-driven from the start, relying on computers to sift through the huge piles of 1s and 0s that make up digital images and sniff out the statistical similarities between them. “We have, early on, and sooner than other groups, banked very heavily on machine learning as opposed to model-based vision,” he says.
Trained in Germany, Neven spent the late 1990s and early 2000s at the University of Southern California, in labs devoted to computational vision and human-machine interfaces. After tiring of the grant-writing treadmill, he struck out on his own, co-founding a company called Eyematic around a unique and very specific application of computer vision: using video from a standard camcorder to “drive” computer-generated characters in 3D. When that technology failed to pay off, Neven started Neven Vision, which began from the same foundation—facial feature tracking—but wound up exploring areas as diverse as biometric tools for law enforcement and visual searches for mobile commerce. “What Goggles is today, we started out working on at Neven Vision on a much smaller scale,” he says. “Take an image of a Coke can, and be entered in a sweepstakes. Simple, early applications that would generate revenue.”
How much revenue Neven Vision actually generated isn’t on record—but the company did have a reputation for building some of the most accurate face recognition software on the market, which was Google’s stated reason for acquiring the company in 2006. The team’s first assignment, Neven says, was to put face recognition into Picasa—the photo management system Google had purchased a couple of years before.
Given how far his team’s computer vision tools have evolved since then, Neven Vision probably should have held out for more money in the acquisition, Neven jokes today. “We said, ‘We can do more than face recognition—one of our main products is visual mobile search.’ They knew it, but they kept a poker face and said, ‘All we want is the face recognition, we are just going to pay for that.'”
Once the Picasa project was done, Neven’s team had to figure out what to do next. His initial pitch to his managers was to build visual search app for packaged consumer goods. That was when Google’s poker face came off. “We said, ‘Let’s do a verticalized app that supports users in finding information about products.’ And then one of our very senior engineers, Udi Manber, came to the meeting and said, ‘No, no, it shouldn’t be vertical. It’s in Google’s DNA to go universal. We understand if you can’t quite do it yet, but that should be the ambition.'” The team was being told, in other words, to build a visual search tool that could identify anything.
That was “a little bit of a scary prospect,” Neven says. But on the other hand, the team had already developed modules or “engines” that were pretty good at recognizing things within a few categories, such as famous structures (the Eiffel Tower, the Golden Gate Bridge). And it had seen the benefits of doing things at Google scale. Neven Vision’s original face recognition algorithm had achieved a “significant jump in performance” simply because the team was now able to train it using tens of millions of images, instead of tens of thousands, and to parallelize the work across thousands of computers.
“Data is the key for pretty much everything we do,” Neven says. “It’s often more critical than the innovation on the algorithmic side. A dumb algorithm with more data beats a smart algorithm with less data.”
In practice, Neven’s team has been throwing both algorithms and data at the general computer vision problem. Goggles isn’t built around a single statistical model, but a variety of them. “A modern computer vision algorithm is a complex building with many stories and little towers on the side,” Neven says. “Whenever I visit a university and I see a piece that I could add, we try to find an arrangement with the researchers to bring third-party recognition software into Goggles as we go. We have the opposite of ‘Not Invented Here’ syndrome. If we find something good, we will add it.”
Goggles is really good at reading text (and translating it, if asked); it can work wonders with a business card or a wine label. If it has a good, close-up image to work with, it’s not bad at identifying random objects—California license plates, for example. And if it can’t figure out what it’s looking at, it can, at the very least, direct you to a collection of images with similar colors and layouts. “We call that internally the Fail Page, but it gives the user something, and over time this will show up less and less,” Neven says.
As even Neven acknowledges, Goggles isn’t yet a universal visual search tool; that’s why it’s still labeled as a Google Labs project, not an officially supported Google product. Its ability to identify nearly 200,000 works by famous painters, for example, is a computational parlor trick that, in truth, doesn’t add much to its everyday utility. The really hard work—getting good at identifying random objects that don’t have their own Wikipedia entries—is still ahead. “What keeps me awake at night is, ‘What are the honest-to-God use cases that we can deliver,’ where it’s not just an ‘Oh, wow,'” Neven says. “We call it the bar of daily engagement. Can we make it useful enough that every day you will take out Goggles and do something with it?”
But given the huge amount of learning material Google collects from the Web every day, the company’s image recognition algorithms are likely to clear that bar more and more often. They have savant-like skill in some areas: they can tell amur leopards from clouded leopards, based on their spot patterns. They can round up images not just of tulips but of white tulips. The day isn’t all that far away, it seems clear, when Goggles will come close to fulfilling Neven’s image of the bird looking over your shoulder, always ready to tell you what you’re seeing.
The Next Great Stage of Search
What reaching this point might mean on a sociocultural level—in areas like travel and commerce, learning and education, surveillance and privacy—is a question that we’ll probably have to confront sooner than we expected. Why? Because it’s very clear that this is where Google wants to go.
Here’s how Schmidt put it in his speech: “When I walk down the streets of Berlin, I love history, [and] what I want is, I want the computer, my smartphone, to be doing searches constantly. ‘Did you know this occurred here, this occurred there?’ Because it knows who I am, it knows what I care about, and it knows roughly where I am.” And, as Schmidt might have added, the smartphone will know what he’s seeing. “So this notion of autonomous search, the ability to tell me things that I didn’t know but I probably am very interested in, is the next great stage, in my view, of search.”
This type of always-on, always-there search is, by definition, mobile. Indeed, Schmidt says Google search traffic from mobile devices grew by 50 percent in the first half of 2010, faster than every other kind of search. And by sometime between 2013 and 2015, analysts agree, the number of people accessing the Web from their phones and tablet devices will surpass the number using desktop and laptop PCs.
By pursuing a data-driven, cloud-based, “mobile first” strategy, therefore, Google is staking its claim in a near-future world where nearly every computing device will have its own eyes and ears, and where the boundaries of the searchable will be much broader. “Google works on the visual information in the world, the spoken and textual and document information in the world,” says Michael Cohen, Google’s speech technology leader. So in the long run, he says, technologies like speech recognition, machine translation, and computer vision “help flesh out the whole long-term vision of organizing literally all the world’s information and making it accessible. We never want you to be in a situation where you wish you could get at some of this information, but you can’t.”
Whatever you’re looking for, in other words, Google wants to help you find it—in any language, via text, sound, or pictures. (And if it can serve up a few ads in the process, so much the better.) That’s the real promise of having a “supercomputer in your pocket,” as Schmidt put it. But what we do with these new superpowers is up to us.
[Update, 2/28/11: Click here for a convenient single-page version of all three parts of “Inside Google’s Age of Augmented Humanity.”]