Still smarting from Apple Intelligence butchering a headline, the BBC has published research into how accurately AI assistants summarize news – and the results don’t make for happy reading.
In January, Apple’s on-device AI service generated a headline of a BBC news story that appeared on iPhones claiming that Luigi Mangione, a man arrested over the murder of healthcare insurance CEO Brian Thomson, had shot himself. This was not true and the public broadcaster complained to the tech giant.
Apple first promised software changes to “further clarify” when the displayed content is a summary provided by Apple Intelligence, then later temporarily disabled News and Entertainment summaries. It is still not active as of iOS 18.3, released in the last week of January.
But Apple Intelligence is far from the only generative AI service capable of news summaries, and the episode has clearly given the BBC pause for thought. In original research [PDF] published yesterday, Pete Archer, Programme Director for Generative AI, wrote about the corporation’s enthusiasm for the technology, detailing some of the ways in which the BBC had implemented it internally, from using it to generate subtitles for audio content to translating articles into different languages.
“AI will bring real value when it’s used responsibly,” he said, but warned: “AI also brings significant challenges for audiences, and the UK’s information ecosystem.”
The research focused on OpenAI’s ChatGPT, Microsoft’s Copilot, Google’s Gemini, and Perplexity assistants, assessing their ability to provide “accurate responses to questions about the news; and if their answers faithfully represented BBC news stories used as sources.”
The assistants were granted access to the BBC website for the duration of the research and asked 100 questions about the news, being prompted to draw from BBC News articles as sources where possible. Normally, these models are “blocked” from accessing the broadcaster’s websites, the BBC said.
Responses were reviewed by BBC journalists, “all experts in the question topics,” on their accuracy, impartiality, and how well they represented BBC content. Overall:
But which chatbot performed worst? “34 percent of Gemini, 27 percent of Copilot, 17 percent of Perplexity, and 15 percent of ChatGPT responses were judged to have significant issues with how they represented the BBC content used as a source,” the Beeb reported. “The most common problems were factual inaccuracies, sourcing, and missing context.”
Inaccuracies that the BBC found troubling included Gemini stating: “The NHS advises people not to start vaping, and recommends that smokers who want to quit should use other methods,” when in reality the healthcare provider does suggest it as a viable method to get off cigarettes through a “swap to stop” program.
As for French rape victim Gisèle Pelicot, “Copilot suggested blackouts and memory loss led her to uncover the crimes committed against her,” when she actually found out about these crimes after police showed her videos discovered on electronic devices confiscated from her detained husband.
Apple solves broken news alerts by turning off the AI
Apple shrugs off BBC complaint with promise to ‘further clarify’ AI content
Apple called on to ditch AI headline summaries after BBC debacle
Apple Intelligence summary botches a headline, causing jitters in BBC newsroom
When asked about the death of TV doctor Michael Mosley, who went missing on the Greek island of Symi last year, Perplexity said that he disappeared on October 30, with his body found in November. He died in June 2024. “The same response also misrepresented statements from Dr Mosley’s wife describing the family’s reaction to his death,” the researchers wrote.
There are many more examples of inaccuracies or lack of context in the paper – including Gemini saying that “it is up to each individual to decide whether they believe Lucy Letby is innocent or guilty.” Letby is serving 15 life sentences for murdering seven babies and attempting to murder seven others between 2015 and 2016, having been convicted in a court of law.
In an accompanying blog post, BBC News and Current Affairs CEO Deborah Turness wrote: “The price of AI’s extraordinary benefits must not be a world where people searching for answers are served distorted, defective content that presents itself as fact. In what can feel like a chaotic world, it surely cannot be right that consumers seeking clarity are met with yet more confusion.
“It’s not hard to see how quickly AI’s distortion could undermine people’s already fragile faith in facts and verified information. We live in troubled times, and how long will it be before an AI-distorted headline causes significant real world harm? The companies developing Gen AI tools are playing with fire.”
Training cutoff dates for various models certainly don’t help, yet the research lays bare the weaknesses of generative AI in summarizing content. Even with direct access to the information they are being asked about, these assistants still regularly pull “facts” from thin air.
There are deeper potential consequences in the professional world, where the tech giants are encouraging workers to use generative AI to write emails, summarize meetings, and so on. What if the recipient also uses AI to respond to that email? Eventually, the signal will be drowned out and all will be noise. Plus, there is already research out from Microsoft suggesting that generative AI is causing workers’ critical thinking faculties to atrophy.
The Register asked Microsoft, OpenAI, Google, Perplexity, and Apple to comment.
An OpenAI spokesperson said: “We support publishers and creators by helping 300 million weekly ChatGPT users discover quality content through summaries, quotes, clear links, and attribution. We’ve collaborated with partners to improve in-line citation accuracy and respect publisher preferences, including enabling how they appear in search by managing OAI-SearchBot in their robots.txt. We’ll keep enhancing search results.” ®
GIPHY App Key not set. Please check settings