Facebook may be infamous for helping to usher in the era of “fake news”; but it’s also tried to find a place for itself in the follow-up: the never-ending battle to combat it. In the latest development on that front, Facebook parent Meta today announced a new tool called Sphere, an AI tool built around the concept of tapping the vast repository of information on the open web to provide a knowledge base for AI and other systems to work. Sphere’s first user, Meta says, is Wikipedia, which is using it to automatically scan entries and identify when citations in its entries are strongly or weakly supported.
The research team has open sourced Sphere — which is currently based on 134 million public web pages.
Here is how it works in action:
The idea behind using Sphere for Wikipedia is a straightforward one: the online encyclopedia has 6.5 million entries and is on average seeing some 17,000 articles added each month. The wiki concept behind that means effectively that adding and editing content is crowdsourced, and while there is a team of editors tasked with overseeing that, it’s a daunting task that grows by the day, not just because of that size but because of its mandate, considering how many people, educators and others rely on it as a repository of records.
At the same time, the Wikimedia Foundation, which oversees Wikipedia, has been weighing up new ways of leveraging all that data. Last month, it announced an Enterprise tier and its first two commercial customers, Google and the Internet Archive, which use Wikipedia-based data for their own business-generating interests and will now have wider and more formal service agreements wrapped around that.
To be clear, today’s announcements about Meta working with Wikipedia do not reference Wikimedia Enterprise, but generally adding in more tools for Wikipedia to make sure that the content that it has is verified and accurate will be something that potential customers of the Enterprise service will want to know when considering paying for the service.
It’s also not clear whether this deal makes Wikipedia a paying customer of Meta’s, or vice versa — say, Meta becoming an enterprise customer in order to have more access to the data to work on Sphere. Meta does note that to train the Sphere model, it created “a new data set (WAFER) of 4 million Wikipedia citations, significantly more intricate than ever used for this sort of research.” And just five days agoMeta announced that Wikipedia editors were also using a new AI-based language translation tool that it had built, so clearly there is a relationship there.
We have asked and will update this post as we know more.
For now, a few more details on Sphere and how Wikipedia is using it, and what might be coming next:
— Meta believes that the “white box” knowledge base that Sphere represents has significantly more data (and by implication more sources to match for verification) than a typical “black box” knowledge sources out there that are based on findings from, for example, proprietary search engines. “Because Sphere can access far more public information than today’s standard models, it could provide useful information that they cannot,” it noted in a blog post. The 134 million documents that Meta used to bring together and train Sphere were split into 906 million passages of 100 tokens each.
— By open sourcing this tool, Meta’s argument is that it’s a more solid foundation for AI training models and other work than any proprietary-based base. All the same, it concedes that the very foundations of knowledge are potentially shaky, especially in these early days. What if a “truth” is simply not being reported as widely as misinformation is? That’s where Meta wants to focus its future efforts in Sphere. “Our next step is to train models to assess the quality of retrieved documents, detect potential contradictions, prioritize more trustworthy sources — and, if no convincing evidence exists, concede that they, like us, can still be stumped,” it noted.
— Along those lines, this raises some interesting questions on what Sphere’s hierarchy of truth will be based on compared to those of other knowledge bases. The idea seems to be that because it’s open sourced, there may be an ability for the users to tweak those algorithms in ways better suited to their own needs. (For example, a legal knowledge base may put more credibility on court filings and case law databases than a fashion or sports knowledge base might.)
— We’ve asked but have yet to get a response on whether Meta is using Sphere or a version of it on its own platforms like Facebook and Instagram, Messenger, which themselves have long grappled with misinformation and toxicity from bad actors. (We have also asked whether there are other customers in line for Sphere.)
— The current size of Wikipedia has arguably exceeded what any sized team of humans alone could check for accuracy, so the idea here is that Sphere is being used to automatically scan hundreds of thousands of citations simultaneously to spot when a citation doesn’t have much support across the wider web: “If a citation seems irrelevant, our model will suggest a more applicable source, even pointing to the specific passage that supports the claim,” it noted. It sounds like the editors might be selecting the passages which might need verifying for now. “Eventually, our goal is to build a platform to help Wikipedia editors systematically spot citation issues and quickly fix the citation or correct the content of the corresponding article at scale.”