Study: AI chatbots provide less-accurate information to vulnerable users
Summary
Research from the MIT Center for Constructive Communication, published February 19, 2026, reveals that leading AI models like OpenAI's GPT-4, Anthropic's Claude 3 Opus, and Meta's Llama 3 provide less accurate and less truthful information to vulnerable users. The study found significant drops in response quality for users with lower English proficiency, less formal education, and non-US origins, particularly those at the intersection of these categories. Models also exhibited higher refusal rates for these user groups, sometimes responding with condescending language or mimicking broken English. For instance, Claude 3 Opus refused nearly 11% of questions for less educated, non-native English speakers, compared to 3.6% for control users, and withheld information on certain topics from Iranian or Russian users. These findings, presented at the AAAI Conference on Artificial Intelligence, suggest LLMs may exacerbate existing information inequities.
Key takeaway
For AI developers and product managers designing or deploying LLMs, you must rigorously test your models for targeted underperformance and bias across diverse user demographics, especially those with lower English proficiency or less formal education. Your evaluation should extend beyond accuracy to include refusal rates and the tone of responses, as current models risk exacerbating information inequities for the very users who could benefit most. Prioritize mitigating these biases before widespread deployment, particularly with personalization features.
Key insights
AI chatbots underperform for vulnerable users, providing less accurate information and exhibiting biased behaviors.
Principles
- LLM biases can compound across user demographic traits.
- Alignment processes may inadvertently incentivize information withholding.
- LLM performance mirrors human sociocognitive biases.
Method
Researchers tested GPT-4, Claude 3 Opus, and Llama 3 using TruthfulQA and SciQ datasets. User biographies varying education, English proficiency, and country of origin were prepended to questions to assess performance impacts.
In practice
- Audit LLM performance across diverse user demographics.
- Scrutinize personalization features for differential treatment.
- Evaluate refusal rates and tone for vulnerable user groups.
Topics
- Large Language Models
- AI Bias
- Information Equity
- GPT-4
- Claude 3 Opus
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Data.