Diffbot’s AI model doesn’t guess — it knows, thanks to a trillion-fact knowledge graph

January 9, 2025 7:00 AM

Credit: VentureBeat made with Midjourney

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Diffbot, a small Silicon Valley company best known for maintaining one of the world’s largest indexes of web knowledge, announced today the release of a new AI model that promises to address one of the biggest challenges in the field: factual accuracy.

The new model, a fine-tuned version of Meta’s LLama 3.3, is the first open-source implementation of a system known as graph retrieval-augmented generation, or GraphRAG.

Unlike conventional AI models, which rely solely on vast amounts of preloaded training data, Diffbot’s LLM draws on real-time information from the company’s Knowledge Graph, a constantly updated database containing more than a trillion interconnected facts.

“We have a thesis: that eventually general-purpose reasoning will get distilled down into about 1 billion parameters,” said Mike Tung, Diffbot’s founder and CEO, in an interview with VentureBeat. “You don’t actually want the knowledge in the model. You want the model to be good at just using tools so that it can query knowledge externally.”

How it works

Diffbot’s Knowledge Graph is a sprawling, automated database that has been crawling the public web since 2016. It categorizes web pages into entities such as people, companies, products and articles, extracting structured information using a combination of computer vision and natural language processing.

Every four to five days, the Knowledge Graph is refreshed with millions of new facts, ensuring it remains up-to-date. Diffbot’s AI model leverages this resource by querying the graph in real time to retrieve information, rather than relying on static knowledge encoded in its training data.

For example, when asked about a recent news event, the model can search the web for the latest updates, extract relevant facts, and cite the original sources. This process is designed to make the system more accurate and transparent than traditional LLMs.

“Imagine asking an AI about the weather,” Tung said. “Instead of generating an answer based on outdated training data, our model queries a live weather service and provides a response grounded in real-time information.”

How Diffbot’s Knowledge Graph beats traditional AI at finding facts

In benchmark tests, Diffbot’s approach appears to be paying off. The company reports its model achieves an 81% accuracy score on FreshQA, a Google-created benchmark for testing real-time factual knowledge, surpassing both ChatGPT and Gemini. It also scored 70.36% on MMLU-Pro, a more difficult version of a standard test of academic knowledge.

Perhaps most significantly, Diffbot is making its model fully open-source, allowing companies to run it on their own hardware and customize it for their needs. This addresses growing concerns about data privacy and vendor lock-in with major AI providers.

“You can run it locally on your machine,” Tung noted. “There’s no way you can run Google Gemini without sending your data over to Google and shipping it outside of your premises.”

Open-source AI could transform how enterprises handle sensitive data

The release comes at a pivotal moment in AI development. Recent months have seen mounting criticism of large language models’ tendency to “hallucinate” or generate false information, even as companies continue to scale up model sizes. Diffbot’s approach suggests an alternative path forward, one focused on grounding AI systems in verifiable facts rather than attempting to encode all human knowledge in neural networks.

“Not everyone’s going after just bigger and bigger models,” Tung said. “You can have a model that has more capability than a big model with kind of a non-intuitive approach like ours.”

Industry experts note that Diffbot’s Knowledge Graph-based approach could be particularly valuable for enterprise applications where accuracy and auditability are crucial. The company already provides data services to major firms including Cisco, DuckDuckGo and Snapchat.

The model is available immediately through an open-source release on GitHub and can be tested through a public demo at diffy.chat. For organizations wanting to deploy it internally, Diffbot says the smaller 8-billion-parameter version can run on a single Nvidia A100 GPU, while the full 70-billion-parameter version requires two H100 GPUs.

Looking ahead, Tung believes the future of AI lies not in ever-larger models, but in better ways of organizing and accessing human knowledge: “Facts get stale. A lot of these facts will be moved out into explicit places where you can actually modify the knowledge and where you can have data provenance.”

As the AI industry grapples with challenges around factual accuracy and transparency, Diffbot’s release offers a compelling alternative to the dominant bigger-is-better paradigm. Whether it succeeds in shifting the field’s direction remains to be seen, but it has certainly demonstrated that when it comes to AI, size isn’t everything.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Report