Could a Little Startup Called Diffbot Be the Next Google?
In tech journalism, it’s inadvisable to call any company “the next Google.” It’s almost always breathless hype or marked naïveté.
After all, people have been predicting the search giant’s demise for nearly as long as the company has existed. I wrote a Technology Review cover story called “Search Beyond Google” nearly 10 years ago. But with unlimited brainpower and money at its disposal, the company has managed to stay at the forefront in search, while also getting very good at other things, like mobile hardware.
So when I tell you that a seven-employee company called Diffbot really could be the next Google, I need to be very specific about what I mean.
I don’t mean that the tiny Palo Alto, CA-based startup is going to put Google out of business. In fact, Diffbot may already be partnering with Google. And there’s a good chance Google will just acqui-hire the startup at some point, thereby preempting the very interesting branch of the timeline where Diffbot gets big on its own.
And I don’t mean that Diffbot is going to redefine the search business. Not the search business as we’ve known it, anyway.
What I do mean is that Diffbot is poised to help the consumer and business worlds make sense of today’s more diverse Internet—one that takes many more forms, and is being put to many more uses, than the Web as it looked back in the 1990s, when Google was born.
Diffbot’s business is to use a combination of crawling software, computer vision, and machine learning to classify documents on the Web and break down each page type into its component parts. (The startup thinks there are about 20 of these types.) This allows people or programs to ask very specific questions about those parts—questions that can’t be answered very well using traditional search technology.
In other words, Diffbot is to today’s Internet as Google was to the Web of 1998. It’s a tool that can impose structure and meaning on resources that are currently disorganized and inaccessible, for a price that many businesses are willing to pay. And so far, that’s a game that Google itself doesn’t seem to want to play.
After writing my first story about Diffbot back in July 2012, I wanted to know about the latest progress at the Stanford-born startup, so I paid a visit to Diffbot’s new headquarters—a quiet backyard bungalow that feels insulated from all the nearby traffic on El Camino and Embarcadero Road. There, the Diffbot crew put aside their laptops for an hour to update me about the company’s ambitious vision. It hasn’t changed much since 2012, but it’s been fleshed out in key respects.
Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.
But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.
Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.
Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.
The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.
Hence the computer-vision approach. “What we’re trying to do is reverse-engineer the Web presentation and turn it back into structured relations,” Diffbot chief scientist Scott Waterman explains.
The first type of page that Diffbot mastered was the news story. By the time I first met Tung and Diffbot vice president John Davi in 2012, they’d already gotten very good at parsing articles on the Web, and today the Diffbot “Article API,” or application programming interface, is used by hundreds of companies to extract text and reformat it for presentation in Web or mobile news readers. Digg, Instapaper, Onswipe, and Reverb are among Diffbot’s customers. “They’re saying ‘Holy crap, how am I going to get clean text out of this sea of [Web pages], and in that case, we’re a developer’s best friend,” Davi says. “We turn an insurmountable problem into one you can solve with an API integration.”
But articles were just the first page type that Tung’s crew wanted to make Diffbot understand. Today the company offers four APIs—for articles, images, products, and home pages—as well as a classifier that can automatically determine the page type for any URL, and a “Crawlbot” that can comb through entire sites, rather than just specific URLs.
Davi says the startup rushed to finish the images API after it studied a few days’ worth of Twitter posts from mid-2012 and realized that images comprised a whopping 36 percent of the material being shared on the microblogging network. That was a tipoff that understanding image pages would allow the company to parse a huge chunk of the human-readable Web.
But finishing the product API was a much more strategic and potentially lucrative move. The reason is simple: anyone who sells or promotes anything on the Web wants to be able to show the price, and wants to know how competitors are pricing the same wares. “All of the various product-discovery startups—pinning, bookmarking, search, e-commerce companies—want pricing information,” Tung says.
Pinterest is a client, for example. Tung and Davi say Diffbot analyzes the entire “firehose” of data that Pinterest users are putting on their pinboards, including the pages that pins link to, mainly in order to figure out which pins represent products on e-commerce sites.
“The ability to turn on user-facing features based on product data is a potential future revenue stream for these bookmarking sites,” Davi explains. “Say 15 percent of pins are products. They can say, ‘Let’s find out the pricing and availability, then let’s tell the user that this product they just pinned is available at Amazon for $5 less,’ or that it’s just gone on sale somewhere.” If the tip leads to a transaction, the pinning or bookmarking site is then in line for an affiliate commission.
Another user of the product API builds Facebook ads from pages on e-commerce sites. “They end up using our Crawlbot, combined with the product API, to extract data from entire retail sites like Target or QVC, drop the product data into their backend, and generate ads on the fly,” Davi says.
There’s a real business here. Customers who tap the Diffbot APIs up to 250,000 times per month are expected to pay a $300 monthly fee. If your calls are closer to the 5-million-a-month mark, you’ll pay $5,000, and at higher volumes, “custom” pricing goes into effect. One of the major search engines (the startup isn’t allowed to say which one) is paying Diffbot “to improve the richness of their search interface,” Tung says. Almost all of the company’s deals result from inbound inquiries, he says, which means he hasn’t yet needed to hire a sales director.
And there are many page types left to tackle. There will eventually be APIs for things like comment pages, discussion forums, product reviews, social-media status updates, and pages with embedded audio and video (though the startup doesn’t plan to analyze the actual content of media files). Add in the less common kinds of pages such as documents, charts, FAQs, locations, event listings, personal profiles, recipes, games, and error messages, and there are about 20 important page types altogether, Tung says. “Once we have all 20, we will essentially be able to cover the gamut, and convert most of the Web into a database structure,” he says.
Okay—why is that important, and how could it lead to a Google-scale opportunity?
As I’ve been trying to hint, the Web is a far richer place today than it was in 1998, when most pages were limited to text and images. Moreover, Web data is being tapped in new ways—and the majority of the entities using it aren’t even human.
That’s both alarming and intriguing. A study released last month by Silicon Valley Web security firm Incapsula showed that only 38.5 percent of all website traffic comes from real people. Another 29.5 percent comes from malicious bots, including scrapers, spammers, and impersonators—which is, of course, a serious problem. But on the up side, the final 31 percent of traffic comes from search engines and “good bots.”
This includes all of the services, from Instapaper to Flipboard to Pinterest, that extract data for presentation in other forms, leading, ultimately, to more page views for the original publisher. And it includes the growing category of specialized search engines and virtual personal assistants, from Wolfram Alpha to Siri and Google Now, that scour the Web to perform specific tasks for their human masters.
These bots are doing complicated things, which means they thrive on structure. And for them, Diffbot makes the Web a more welcoming place. For one thing, it gives them the ability to launch far more detailed searches against the raw data. “If we have Nike’s entire catalog as a database, you can select queries like ‘Show me x from Nike.com where the price is less than $70,’ and the things you get back aren’t Web pages optimized for viewing on a screen, but the actual record,” Tung says.
For that kind of search—which is more akin to a database query in SQL, the Structured Query Language, than to a keyword-based search—Google just won’t cut it. “Text search gets you only so far,” Waterman says. “When you start to understand the meaning of aspects of the page and glue them together, then you can do all kinds of other things.”
The first company that figures out how to map today’s more complex Web, and open it fully to automated traffic, stands to occupy a central place in tomorrow’s Internet economy. For as soon as data is readable by machines, Tung points out, Tim Berners-Lee’s vision of the Semantic Web will finally begin to take concrete shape. “New knowledge can be created with old knowledge,” he says. “Apps become like mini-AIs that take information, do some value-add with it, and produce other information.”
In September, Diffbot announced that it had brought on Matt Wells, the creator of an open-source search engine called Gigablast. Alongside Google, Bing, and Blekko, Gigablast is one of the only U.S.-based search engines to maintain its own index of the Web; at one time, its index of 12 billion pages was second only to Google’s.
“I believe in Mike’s vision, I see what he’s trying to do, and I thought it would be good to team up with a lot of smart people,” Wells told me. The hire is a sign that Diffbot’s ambitions extend beyond selling access its APIs to something potentially much bigger: constructing a new kind of search engine, built around new types of queries and new ways of formulating intent. And to do that, Diffbot will obviously need its own global index. “We want to convert the entire Web into a structured database,” Tung says. “Matt is one person who has done that Web-scale crawling before. Most of his competitors were teams of thousands of people with millions of dollars.”
So, in the end, Diffbot is a small group of super-talented engineers and machine-learning experts who want to analyze and structure the Web on a huge scale. Yet the Googleplex is just five miles away—and what would be a life-altering amount of money for any of Diffbot’s team members would be pocket change for Google.
So it’s probably silly to imagine a future where Diffbot grows to 10,000 employees and becomes the substrate for a community of AIs, working to make us all happier, more comfortable, and more informed; that is to say, where our online existence isn’t ruled solely by Google and the NSA. But it’s nice to think that it’s possible.