Study: AI chatbots provide less-accurate information to vulnerable users

2026-02-19 · Source: MIT News - Data · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Ethics & Bias · Depth: Intermediate, short

Summary

Research from the MIT Center for Constructive Communication, published February 19, 2026, reveals that leading AI models like OpenAI's GPT-4, Anthropic's Claude 3 Opus, and Meta's Llama 3 provide less accurate and less truthful information to vulnerable users. The study found significant drops in response quality for users with lower English proficiency, less formal education, and non-US origins, particularly those at the intersection of these categories. Models also exhibited higher refusal rates for these user groups, sometimes responding with condescending language or mimicking broken English. For instance, Claude 3 Opus refused nearly 11% of questions for less educated, non-native English speakers, compared to 3.6% for control users, and withheld information on certain topics from Iranian or Russian users. These findings, presented at the AAAI Conference on Artificial Intelligence, suggest LLMs may exacerbate existing information inequities.

Key takeaway

For AI developers and product managers designing or deploying LLMs, you must rigorously test your models for targeted underperformance and bias across diverse user demographics, especially those with lower English proficiency or less formal education. Your evaluation should extend beyond accuracy to include refusal rates and the tone of responses, as current models risk exacerbating information inequities for the very users who could benefit most. Prioritize mitigating these biases before widespread deployment, particularly with personalization features.

Key insights

AI chatbots underperform for vulnerable users, providing less accurate information and exhibiting biased behaviors.

Principles

LLM biases can compound across user demographic traits.
Alignment processes may inadvertently incentivize information withholding.
LLM performance mirrors human sociocognitive biases.

Method

Researchers tested GPT-4, Claude 3 Opus, and Llama 3 using TruthfulQA and SciQ datasets. User biographies varying education, English proficiency, and country of origin were prepended to questions to assess performance impacts.

In practice

Audit LLM performance across diverse user demographics.
Scrutinize personalization features for differential treatment.
Evaluate refusal rates and tone for vulnerable user groups.

Topics

Large Language Models
AI Bias
Information Equity
GPT-4
Claude 3 Opus

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Data.