Argument Collapse: LLMs Flatten Long-Form Public Debate

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study on "argument collapse" reveals that essays generated by Large Language Models (LLMs) tend to converge on a smaller, less diverse set of main arguments, sub-arguments, and structural patterns compared to human-written responses. Analyzing 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 Boston Review (BR) forums, and 23,384 LLM-generated essays from five frontier models (GPT, Claude, Gemini, DeepSeek, Minimax), researchers found significant homogenization. In the NYT corpus, 65.3% of human main arguments were unique, versus only 3.4% for vanilla LLM arguments. Even with "diversified" prompting, LLMs recovered only 50-55% of distinct human main arguments. Sub-arguments also showed collapse, with 41.0% of human sub-arguments being unique compared to 9.1% from LLMs. Qualitatively, LLMs favored generalized and hedged arguments, while humans preferred concrete, topic-specific ones. Structurally, LLM essays followed a more fixed arc, often moving from direct claims to proposals more rapidly than human essays.

Key takeaway

For NLP Engineers developing LLM applications, you should prioritize explicit diversity mechanisms beyond simple prompting. Be aware that LLM-generated arguments tend towards generalized, hedged claims and fixed structures, potentially narrowing public discourse. Implement robust evaluation metrics for argument uniqueness and structural variation to counter this "argument collapse" and ensure your models contribute to a richer, more varied argumentative landscape.

Key insights

LLMs flatten public debate by converging on fewer, more generalized arguments and fixed structures.

Principles

Method

The study compared human and LLM essays from NYT and Boston Review debates, using LLM judges for argument extraction and pairwise overlap labeling across vanilla, diversified, and position-guided generation conditions.

In practice

Topics

Code references

Best for: Research Scientist, AI Product Manager, AI Scientist, AI Ethicist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.