“Everything’s becoming intelligent, but the limiting factor of intelligence is access to structured data,” Tung says.
Diffbot, an artificial intelligence company that helps clients extract and combine data from multiple Web sources wants to scrape all the data on the web (all of it) to put it into a structured format. Making it useful for all sorts of business purposes and make money doing so. The company says its technology “uses computer vision and NLP algorithms to extract and structure any web page into the world’s largest structured database… with no human curation or oversight.”
Founded in 2009 The Palo Alto, CA-based startup announced today it raised $10 million from investors to expand its “knowledge-as-a-service” offerings to businesses and consumer apps. They have raised close to $13 million since its seed round in 2012. Diffbot’s plan is to catalog trillions of facts across the Web—many of them drawn from page elements such as comment forums, which can’t be mined by traditional search engines.
Web-mining can be a competitive advantage for apps as well as the proliferating devices of the Internet of Things, Tung says.
The startup says it has made a significant start on that goal, having indexed 1.2 billion entities such as people, products, and places since the middle of last year. Its Global Index also encompasses 10 to 20 times that number of facts, says Diffbot founder and CEO Mike Tung. Last June, the company said its database had surpassed the size of Google’s Knowledge Graph.
Diffbot does more than search what people are saying on their Twitter and Facebook feeds. It looks at comment threads in Reddit and customer support forums, basically everywhere on the web. By structuring all that wildly unstructured data, Diffbot makes it searchable and thus useful. Small companies can get started for free. Big companies pay based on the volume of data they need to access.
The startup’s key early innovation was to extend the search function into previously uncharted territory by teaching computers how to recognize the various sub-sections of Web pages, including headlines, ad boxes, pictures, and discussion threads. Diffbot could then classify each page by type, such as news articles and product pages. That knowledge allows the computers to find and assemble related information, such as product prices across various retailers, and consumer opinions across many social media platforms and comment sections. The technology creates “structured data” that machines can read and interpret, so says Diffbot.
Diffbot has been scaling up its data center, adding to its bank of proprietary servers with specialized hardware, and integrating Web-based processing power into the system to meet surges of demand. The company’s new money will accelerate the scale-up and fund an expansion of its R&D team, Tung says. Diffbot works in any language, Tung says. “It can tell you who the speakers are, and what they’re saying,” he says. The company’s technology is “sufficiently powerful to reduce information asymmetry.”
“We’ve proven it’s possible to build a profitable AI business model,” Tung says.
With more than 250 customers—including Amazon, CBS Interactive, eBay, Microsoft, Salesforce —Diffbot became profitable at the end of 2015, Tung says. The research groups at Google and Facebook are Diffbot’s closest rivals in the development of methods to gather and synthesize Web data using artificial intelligence technology, Tung says. But rather than keeping the knowledge in-house, Diffbot is making it available to outside companies.
“We’re sort of like Switzerland in the AI wars,” Tung says.
It’s worth noting far larger companies are struggling to find a good business model for AI or cognitive computing or whatever the next name for this self-teaching technology will be. Tung says their operating expenses are low because Diffbot’s automated data collection and analysis technology requires no human curation, he says.
The second goal for Diffbot’s $10 million Series A financing round was to make alliances with investors experienced in artificial intelligence, Tung says. The round was led by Tencent, China’s leading Internet service provider, and Felicis Ventures. Tencent is not a customer of Diffbot’s, Tung says. He adds the word “now.”
Other startups are pursuing a similar AI-as-a-service model, recognizing that while the Internet giants have the resources to push the envelope in things like computer vision and natural language understanding, lots of companies can benefit from these technologies.
Believe it or not Diffbot still has a tiny staff of 14 people.