Google Gets A Second Brain, Changing Everything About Search
(Page 3 of 5)
volunteers, who carefully specify the properties of each new entity and how it fits into existing knowledge categories. (For example, Freebase knows that Jupiter is an entity of type Planet, that it has properties such as a mean radius of 69,911 km, and that it is the fictional setting of two Arthur C. Clarke novels.) While Freebase is now hosted by Google, it’s still open to submissions from anyone, and the information in it can be freely reused under a Creative Commons license. In fact, Microsoft uses Freebase to give its Bing search engine an understanding of entities, which is the same role now played by the Knowledge Graph at Google.
Freebase bears some of the markings of earlier knowledge bases such as Cyc, a project begun almost three decades ago by AI researcher Doug Lenat to build a comprehensive ontology of common-sense knowledge. But Giannandrea is careful to point out that Metaweb wasn’t trying to build an AI system. “We explicitly avoided hard problems about reasoning or complicated logic structures,” he says. “We just wanted to build a big enough data set that it could be useful. A lot of these vocabularies and ontologies, they don’t cover roller coasters, they don’t know how to represent the ingredients in food or the recipe for a cocktail. We wanted to cover all of it.”
One implication of covering “all of it” was that Metaweb had to break away from the classic relational-database model, in which data is stored in orderly tables of rows and columns, and build its own proprietary graph database. In a semantic graph, there are no rows and columns, only “nodes” and “edges,” that is, entities and relationships between them. Because it’s impossible to specify in advance what set of properties and relationships you might want to assign to a real-world entity (what’s known in database lingo as the “schema”), graph databases are far better than relational databases for representing practical knowledge.
“Suppose you have people and the schema is the place they were born and the date of birth, and you have a million people,” explains Giannandrea. “Now you want to add date of death. Changing the schema after the people are loaded: traditional databases aren’t very good at that. That’s why semantic graphs are very powerful—you can keep coming up with new kinds of edges.”
To fill up its knowledge base, Metaweb didn’t rely just on volunteers. It also looked for public databases that it could suck up—Wikipedia, for example, and the CIA World Factbook and the MusicBrainz open music database. “We added entities any way we could,” Giannandrea says.
The real challenge for the startup was weeding out the duplicate entities. In a semantic graph, an entity can only be represented once, or everything falls apart. “The process of reconciliation is the key, hard, expensive thing to do,” Giannandrea says. To help pay for that, Metaweb built and sold software tools drawing on Freebase that partners could use to make their own information products more useful. The Wall Street Journal, for example, hired Metaweb to build a database to help its readers pivot between different types of related content.
By the time Google came knocking in 2010, Metaweb and outside contributors had spent five years loading entities into Freebase. The search giant’s acquisition offer was attractive, Giannandrea says, in part because adding to the database was becoming more difficult. Metaweb couldn’t ingest all the world’s knowledge at once, but it didn’t know which sources were most important. “We wanted to make it easier for people to find stuff,” he says. “But that problem is harder if you don’t know what people are looking for. And one of the things a search engine has a good understanding of is what people are trying to figure out. That can help with the prioritization.”
Freebase has doubled in size to about 24 million entities since Google acquired Metaweb. But the Knowledge Graph—which Freebased helped to nucleate—has grown much faster, shooting past half a billion entities in less than three years. There are two reasons for this rapid growth, Giannandrea says. One is that Google itself owns huge databases of real-world things like products (Google Catalogs) and geographical locations (Google Maps). “There is lots and lots of data at Google, not all of which we can free up, but a lot of it, which explains why the Knowledge Graph is much, much larger than the original Freebase,” Giannandrea says. The other reason is Google’s search logs—its real-time picture of what people are searching for, in every country where Google is available, which helps Giannandrea’s team decide which corners of the Knowledge Graph it needs to fill out next.
Being part of Google has brought a few technical advantages as well. With help from Singhal’s team, the Metaweb engineers have been able to improve the algorithms that pull new data into the Knowledge Graph and vet it for accuracy. Not every new entity is reviewed by a human—with about 40 times as many entities as Wikipedia, the Knowledge Graph is far too large for that—but Google has developed quality-assurance systems that let human workers sample a statistically significant fraction of them for precision.
At the same time, the reconciliation problem gets easier with scale. “If I say to you there is a person called Harrison Ford, you couldn’t reconcile that because your database might have 10 Harrison Fords,” says Giannandrea. “But if I said he was a movie actor you’d get closer. Then if I said he was born in a certain year, you’d say okay, fine.” The same principle applies to aliases like “the Fresh Prince,” which is really the same entity as Will Smith. The more facts the Knowledge Graph contains, in other words, the easier it is to eliminate duplication.
But is 570 million entities enough to build a working representation of the world—or is Google still just getting started? “I think it’s a lot,” Giannandrea says. For comparison, he points to biologist E.O. Wilson’s Encyclopedia of Life project, which contains listings for … Next Page »