CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Cavewoman, a two-channel evaluation protocol, assesses how Large Language Models behave under linguistic input and output compression. The protocol scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. Evaluating eight models across five datasets at five reduction levels, the study found that output compression significantly cuts realized cost on most API models (1.4-2.4x, up to 3x) and all four open-weight models under public-tier pricing. Conversely, input compression proved counterproductive, raising net cost (~1.15x on average, up to 1.8x on the worst dataset, and 2.7x under stronger compression) while simultaneously collapsing accuracy. Furthermore, surface text often diverged from the unconstrained reference, with roughly half of correct generations from non-reasoning models no longer entailing their own unconstrained baseline.

Key takeaway

For Machine Learning Engineers optimizing LLM inference costs, you should prioritize output compression strategies, as they demonstrably reduce per-item costs by 1.4-2.4x. Conversely, avoid input compression, which paradoxically increases net costs by ~1.15x and degrades model accuracy. Always validate that compressed outputs maintain semantic fidelity to uncompressed baselines, especially for non-reasoning models, to prevent unintended content divergence.

Key insights

Linguistic compression impacts LLM cost and accuracy differently based on input versus output channels.

Principles

Output compression reduces LLM inference costs.
Input compression increases LLM costs and reduces accuracy.
Compressed LLM outputs can diverge semantically from uncompressed baselines.

Method

Cavewoman is a two-channel evaluation protocol scoring LLM generations on task accuracy, per-item cost, and reference-text agreement against unconstrained baselines across varied compression levels.

In practice

Prioritize output compression for cost savings.
Avoid input compression to maintain accuracy and control costs.
Verify semantic equivalence of compressed outputs.

Topics

Large Language Models
Linguistic Compression
Inference Cost
Model Evaluation
Output Quality
Semantic Divergence

Code references

danielle34/cavewoman

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.