Diffbot Is Using Computer Vision to Reinvent the Semantic Web

7/25/12Follow @wroush

You know how the Picturephone, a half-billion-dollar project at AT&T back in the 1960s and 1970s, turned out to be a huge commercial flop, but two-way video communication eventually came back with a vengeance in the form of Skype and FaceTime and Google Hangouts? Well, something similar is going on with the Semantic Web.

That’s the proposal, dating back almost to the invention of the Web in the 1990s, that the various parts of Web pages should be tagged so that machines, as well as people, can make inferences based on the information they contain. The idea has never gotten very far, mainly because the burden of tagging all that content would fall to humans, which makes it expensive and tedious. But now it looks like the original goal of making digital content more comprehensible to computers might be achievable at far lower cost, thanks to better software.

Diffbot is building that software. This unusual startup—the first ever to emerge from the Stanford-based accelerator StartX, back in 2009—is using computer vision technology similar to that used for robotics applications such as self-driving cars to classify the parts of Web pages so that they can be reassembled in other forms. AOL is one of the startup’s first big customers and its landlord. It’s using Diffbot’s technology to assemble Editions by AOL, the personalized, iPad-based magazine comprised of content culled from AOL properties like the Huffington Post, TechCrunch, and Engadget.

NPR's top news page as interpreted by Diffbot (click for larger version)

I went down to AOL’s Palo Alto campus last month to meet the company’s founder and CEO Mike Tung and its vice president of products John Davi. They didn’t deliberately set out to solve the Semantic Web problem, any more than the founders of Skype set out to build an affordable Picturephone. But their venture, which has attracted about $2 million in backing from Andy Bechtolsheim and a raft of other angel investing stars, is already on its way to creating one of the world’s largest structured indexes of unstructured Web content.

Without relying on HTML tags (which can actually be used to trick traditional Web crawling software), Diffbot can look at a news page and tell what’s a headline, what’s a byline, where the article text begins and ends, what’s an advertisement, and so forth. What practical use can companies make of that, and where’s the profit in it for Diffbot? Well, aside from AOL, the startup’s software is already being used in some interesting places: reading app maker Pocket (formerly Read It Later) uses it to extract article text from websites, and content discovery service StumbleUpon employs it to screen out spam.

In fact, companies pay Diffbot to analyze more than 100 million unique URLs per month. And that’s just the beginning. Building outward from its early focus on news articles, the startup is creating new algorithms that could make sense of many kinds of sites, such as e-commerce catalogs. The individual elements of those sites could then be served up in almost any context. Imagine a Siri for shopping, to take just one example. “We’re building a series of wedges that will add up to a complete view of the Web,” says Davi. “We are excited about having them all under our belt, so there can be a fully indexed, reverse-engineered Semantic Web.”

What follows is a highly compressed version of my conversation with Tung and Davi.

Xconomy: Where did you guys meet, and how did you end up working on Diffbot?

Mike Tung: I worked at Microsoft on Windows Vista right out of high school, then went to college at Cal and studied electrical engineering for two years, then went to Stanford to start a PhD in computer science, specializing in AI. When I first moved to Silicon Valley, I also worked at a bunch of startups. I was engineer number four at TheFind, which was a product search company that built the world’s largest product index. I worked on search at Yahoo and eBay, and also did a bunch of contract work. I took the patent bar and worked as a patent lawyer for a couple of years, writing 3G and 4G patents for Panasonic and Matsushita. I first met John when we were working at a startup called ClickTV, which was a video-player-search-engine thing. It was pretty advanced for its time.

Diffbot began when I was in grad school at Stanford [in 2005]. There was this one quarter where I was taking a lot of classes, so I made this tool for myself to keep track of all of them. I would put in the URL for the class website, and whenever a professor would upload new slides or content, Diffbot would find that and download it to my phone. I always felt like I knew what was going on in my classes without having to attend every single one.

It was useful, and my friends started asking me whether they could use it. So I turned it into a Web service and started running it out of a dorm at Stanford. And people started adding a bunch of different kinds of URLs to Diffbot outside of classes, like they might add Craigslist if they were searching for a job or a product, or Facebook if they wanted to see if their ex’s profile had changed.

X: So I assume the name “Diffbot” related to comparing the old and new versions of a website and detecting the differences?

MT: Yes, but just doing deltas on Web pages doesn’t work too well. It turns out that on the modern Web, every page refresh changes the ads and the counters. You have to be a little more intelligent.

That’s where understanding the page comes into play. I was studying machine learning at Stanford, and in particular one project I had worked on was the vision system for the self-driving car [Stanford’s entry in the 2007 DARPA Urban Challenge]. This was the stereo camera system that would compute the depth of a scene and say, ‘This is a cactus, this is drivable dirt, this is not drivable dirt, this is a cliff, this is a very narrow passageway.’ I realized that one way of making Diffbot generalizable was to apply computer vision to Web pages. Not to say, ‘This is a cactus and this is a pedestrian,’ but to say, ‘This is an advertisement and this is a footer and this is a product.’

A human being can look at Web page and very easily tell what type of page it is without even looking at the text, and that is what we are teaching Diffbot to do. The goal is to build a machine-readable version of the entire Web.

X: Isn’t that what Tim Berners-Lee has been talking about for years—building a Semantic Web that’s machine-readable?

MT: It seems that every three years or so a new Semantic Web technology gets hyped up again. There was RSS, RDF, OWL, and now it’s Open Graph and the Knowledge Graph. The central problem—why none of these have really gone mainstream—is that you are requiring humans to tag the content twice, once for the machine’s benefit and once for the actual humans. Because you are placing so much onus on the content creators, you are never going to have all of the content in any given system. So it will be fragmented into different Semantic Web file formats, and because of that you will never have an app that allows you to search and evaluate all that information.

But what if you analyze the page itself? That is where we have an opportunity, by applying computer vision to eliminate the problem of manual tagging. And we have reached a certain point in the technology continuum where it is actually possible—where the CPUs are fast enough and the machine learning technology is good enough that we have a good shot of doing it with high accuracy.

X: Why are you so convinced that a human-tagged Semantic Web would never work?

MT: The number one point is that people are lazy. The second is that people lie. Google used to read the meta tags and keywords at the top of a Web page, and so people would start stuffing those areas with everything. It didn’t correspond to what actual humans saw. The same thing holds for Semantic Web formats. Whenever you have things indexed separately, you start to see spam. By using a robot to look at the page, you are keeping it above that.

X: Talk about the computer vision aspect of Diffbot. How literal is the comparison to the cameras and radar on robot cars?

MT: We use the very same techniques used in computer vision, for example object detection and edge detection. If you are a customer, you give us a URL to analyze. We render the page using a virtual Webkit browser in the cloud. It will render the page, run the Javascript, and lay everything out with the CSS rules and everything. Then we have these hooks into Webkit that allow us to get all of the visual and geometric information out of the page. For every rectangle, we pull out things like the x and y coordinates, the heights and widths, the positioning relative to everything else, the font sizes, the colors, and other visual cues. In much the same way, when I was working on the self-driving car, we would look at a patch and do edge detection to determine the shape of a thing or find the horizon.

X: Once you identify those shapes and other elements, how do you say, “This is a headline, this is an article,” et cetera?

MT: We have an ontology. Other people have done good work defining what those ontologies should be—there are many of them at schema.org, which reflects what the search engines have proposed as ontologies. We also have human beings who draw rectangles on the pages and teach Diffbot “this is what an author field looks like, this is what a product looks like, this is what a price looks like,” and from those rectangles we can generalize. It’s a machine learning system, so it lives and breathes on the training data that is fed into it.

X: Do you actually do all the training work yourselves, or do you crowdsource it out somehow?

John Davi: We have done a combination of things. We always have a cold-start problem firing up new type of pages—products versus articles, or a new algorithm for press releases, for example. We leverage both grunt work internally—just grinding out our own examples, which has the side benefit of keeping us informed about the real world—but yeah, also crowdsourcing, which gives us a much broader variety of input and opinion. We have used everything, including off-the-shelf crowdsourcing tools like Mechanical Turk and Crowdflower, and we have build up our own group of quasi-contract crowdsourcers.

Our basic effort is to cold-start it ourselves, then get an alpha-level product into the hands of our customer, which will then drastically increase the amount of training data we have. Sometimes we look at the stream of content and eyeball it and manually tweak and correct. In a lot of cases our customer gets involved. If they have an interest in helping to train the algorithm—it not only makes it better for them, but if they are first out of the gate they can tailor the algorithm to their very particular needs.

X: How much can your algorithms tell about a Web page just from the way it looks? Are you also analyzing the actual text?

MT: First we take a URL and determine what type of page it is. We’ve identified roughly 20 types of pages that all the Web can fall into. Article pages, people pages, product pages, photos, videos, and so on. So one of the fields we return will be what is the type of this thing. Then, depending on the type, there are other fields. For the article API [application programming interface], which is one we have out publicly, we can tell you the title, the author, the images, the videos, and the text that go with that article. And we not only identify where the text is, but we can tell you the topics. We do some natural language processing on the text and we can tell you “This is about Apple,” and we can tell it’s about Apple Computer and not the fruit.

JD: Another opportunity we are excited about his how Diffbot can help augment what is natively on the page. Just by dint of following so many pages through our system, we can augment [the existing formatting] and increase the value for whoever is reading. In the case of an article, the fact that we see so many articles means it’s relatively easy for us to generate tags for any given text.

X: How do you turn this all into a business?

MT: We are actually selling something. We are trying to build the Semantic Web, but in a profitable way. We analyze the pages that people pay us to analyze. That’s currently over 100 million URLs per month, which is a good slice of the Web. Other startups have taken the approach of starting by crawling and indexing the Web, and that is very capital-intensive. By doing it this way, another benefit is that people only send us the best parts of the Web. Most of the stuff a typical Web crawler goes through never appears in any search results. Most of the Web is crap.

X: Are people finding uses for the technology that you may not have thought of?

MT: We had a hackathon last year where a guy came in and built an app for his father, who is blind. It runs Diffbot on a page and makes it into a radio station. For someone who is blind, browsing a news site is usually a really poor experience. The usual screen readers will read the entire page, including the nav bars and the ads and the text. The screen readers have no context about what is important on the page. Using Diffbot to be his father’s eyes, this guy could parse the page and read it in a way that is much more natural.

JD: AOL’s Editions app is one of the more interesting use cases that I’ve seen. It’s an iPad app that features both their own content as well as snippets from across the Web, in a daily issue. I spent five years running engineering for the media solutions group at Cisco, selling a Web platform for media companies, and the biggest problem we faced was dealing with the excess of content management systems that all media companies have. In the case of Editions, AOL has myriad properties that they want to merge into this single app. But rather than consolidate TechCrunch and Engadget and the Huffington Post and a half dozen other sites, they use Diffbot to build a kind of content management system on the fly from the rendered Web pages. They extract the content and deliver it on the fly as if it came from a CMS right to the iPad magazine.

StumbleUpon is another interesting one. They use Diffbot as their moderation queue. Whenever a new website is submitted to their index, they want to make sure it’s legitimate before it’s available for stumbling. They have to rule out people who stumble a page, then swap it out for spam. So they run Diffbot on the source page, pipe that into their moderation queue, and if it looks like a legitimate page they can monitor that and keep checking on a regular basis to see how much it changes. If it has changed much between day 1 and day 10, it might warrant human intervention.

X: Aren’t there are a lot of news reader app these days that are doing the same thing you’re doing when it comes to identifying and isolating the text of a news article? That’s what Instapaper and Pocket and Readability and Zite are all doing.

MT: We power a lot of those apps. Our audience is the developers who work at those companies, who use our API to create their experience.

JD: We make it a lot more affordable to make those kinds of forays. When you look at building your own customized extraction tools, you are talking about multiple developers over weeks or months, to build something that is more brittle than what we offer out of the gate. Our ultimate goal is to be not only better but a lot cheaper than what you could build.

X: It’s not totally clear yet, though, whether publications or apps that aggregate lots of content from elsewhere, like Editions or even Flipboard, are going to be profitable in the long term, and where publishing is going as a business. Don’t you guys feel there’s some risk in tying your fortunes to such a troubled industry?

MT: The more interesting question is how do you monetize the Semantic Web, and where is the money in building the structured information. Articles are only one page type. Another that I mentioned is products. If you could show products on a cell phone, and people could buy the product and we could make that transaction happen, that is one very tangible way of making money. I think there is a lot of value in having structured information, because you can connect people more directly to what they want. Once we have the entire Web in machine-readable format, anybody who wants to use any sort of data can use the Diffbot view of it, and I think a lot of those apps can make money. Look at Siri—it’s great but it only works with the 10 or so sources that it’s hard-coded to work with. If you were able to combine Siri with Diffbot, Siri could operate on the Web and take a query and actually do it for you.

X: What page types will you move on to next? Did you start with articles because those are easiest?

MT: I wouldn’t say they were easiest, but they are pretty prevalent on the Web. A variety of factors help us prioritize what we should do next. One signal is what is the prevalence of that type of page on the Web. If doing one page type lets us knock out 30 percent of the Web, maybe we will go for it.

X: Will there always be a need for Diffbot, or with the transition to HTML 5, will Web pages gradually get more structure on their own?

MT: If you look at the ratio of unstructured pages to structured, it’s actually going in the opposite direction. I think human beings are creative, and they design pages for other humans. No matter what, people will find a way to create documents that lie outside of the well-defined tags, whether it’s HTML 5 or Flash or PDF or Xbox. What they all have in common is that they are just vessels that we can easily train and adapt Diffbot to work with.

Wade Roush is a contributing editor at Xconomy. Follow @wroush

By posting a comment, you agree to our terms and conditions.