Stanford Study Uncovers Flaws in AI Chatbots Answering News Questions

A Stanford study found that while leading AI chatbots scored above 90% on many real-time news questions, those results masked significant weaknesses in regional performance, source attribution and responses to flawed prompts.

June 8, 2026 12:55 AM IST | Written by Supriya Singh | Edited by Pratima O Pareek

A new Stanford study assessed six commercial AI chatbots using 2,100 same-day news questions, generating 12,600 model responses across six regions and languages.

Researchers conducted the evaluation over 14 days (February 9–22, 2026) using same-day BBC News questions from six regional services: U.S. & Canada, Afrique, Arabic, Hindi, Russian and Turkish.

The study, titled “Reading Today’s Headlines Through AI: A Real-Time Audit of Six Commercial Chatbots”, examined how leading AI systems perform when answering questions about current news events.

According to the study, top performing models such as Gemini 3 Flash, Grok 4 and Gemini 3 Pro answered correctly more than nine times out of ten. They achieved an accuracy rate of more than 90%.

However, researchers said the strong overall scores obscured three key patterns: a significant performance gap on Hindi-language content, differences in citation and source-selection patterns across chatbots, and weaker performance when questions contained inaccurate assumptions.

The study also identified significant disparities across languages and regions. Hindi-language questions recorded the lowest average accuracy at 79.3%, nearly 10% points below the next-lowest region.

“Every model tested performed the worst in Hindi,” the study said.

Researchers attributed the gap not to language comprehension but to failures in retrieving relevant Hindi-language sources. Retrieval failure accounted for 38.8% of errors, while source divergence, in which models retrieved a thematically related but factually different source, accounted for 32.7%.

“The failure is not one of language comprehension. These systems read Hindi fluently and reason competently in it. It is a failure of evidence binding,” the study highlighted. It further stressed that chatbots often relied on English-language sources covering similar news topics, leading to inaccurate answers.

The researchers also analyzed every URL cited across all 12,600 model responses and found significant differences in citation patterns across chatbots. The study said these differences likely reflect a combination of retrieval systems, licensing arrangements and source-access policies, potentially influencing which information reaches users.

Second, the study found that models relied heavily on English-language sources even when answering questions about non-English news. “Of the six BBC regional services we evaluated, only the U.S. & Canada publishes in English,” the study said.

The study also found that chatbot performance deteriorated when questions contained misleading or inaccurate premises. In adversarial testing, Grok 4 retained 70% accuracy while GPT-5 fell to 19%, highlighting significant differences in how models handled flawed user queries.

Citing a 2026 Reuters Institute survey, the study noted that news executives expect a 43% decline in Google search traffic to publishers over the next three years.

“As more users encounter journalism through AI lenses rather than directly through publishers’ sites, differences in retrieval, attribution, and source selection will increasingly shape whose reporting reaches the general public, under what terms, and how,” the study stated.

According to the study, around 10% of Americans already use AI chatbots for news at least occasionally, while the figure approaches 15% among news consumers under 25 globally.

Also Read: NYT Publisher: ‘Tech Giants Strip-Mine Original Journalism, Repackage Stolen Goods as Their Own’

Authors

Supriya Singh
Supriya Singh is a Reporter at AI FrontPage covering the AI & Education and AI & Jobs beats. She brings six years of print and digital experience, including three years at The Asian Age, where she reported on higher education, Delhi government, and crime. She is based in Delhi-NCR.
LinkedIn

Pratima O Pareek
Pratima O Pareek is an Editor and Co-Founder of AI FrontPage. A gold medalist in Mass Communication and Journalism, she's worked across national and international newsrooms, bringing sharp editorial instincts and a commitment to clarity. She believes in cutting through the noise to deliver stories that actually matter.
Off the clock, she watches offbeat cinema, follows tennis, and explores new places like a traveler, not a tourist.
LinkedIn

Journalism begins where hype ends

,,

Stanford Study Uncovers Flaws in AI Chatbots Answering News Questions

Authors

Related Posts

China’s Kimi K3 Lags Top U.S. Frontier AI Models in Cybersecurity Tests: US CAISI, UK AISI

Meta, NVIDIA, Palantir Among 20+ Firms Urging Congress to Back Open-Weight AI

‘We Are Living in Don’t Look Up’: Sanders Warns Senate on Unchecked AI

After OpenAI’s Hugging Face Hack, Congress Moves to Mandate AI Kill Switches

LATEST NEWS

China’s Kimi K3 Lags Top U.S. Frontier AI Models in Cybersecurity Tests: US CAISI, UK AISI

Meta, NVIDIA, Palantir Among 20+ Firms Urging Congress to Back Open-Weight AI

‘We Are Living in Don’t Look Up’: Sanders Warns Senate on Unchecked AI

After OpenAI’s Hugging Face Hack, Congress Moves to Mandate AI Kill Switches