CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression
Summary
Cavewoman, a two-channel evaluation protocol, assesses how Large Language Models behave under linguistic input and output compression. The protocol scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. Evaluating eight models across five datasets at five reduction levels, the study found that output compression significantly cuts realized cost on most API models (1.4-2.4x, up to 3x) and all four open-weight models under public-tier pricing. Conversely, input compression proved counterproductive, raising net cost (~1.15x on average, up to 1.8x on the worst dataset, and 2.7x under stronger compression) while simultaneously collapsing accuracy. Furthermore, surface text often diverged from the unconstrained reference, with roughly half of correct generations from non-reasoning models no longer entailing their own unconstrained baseline.
Key takeaway
For Machine Learning Engineers optimizing LLM inference costs, you should prioritize output compression strategies, as they demonstrably reduce per-item costs by 1.4-2.4x. Conversely, avoid input compression, which paradoxically increases net costs by ~1.15x and degrades model accuracy. Always validate that compressed outputs maintain semantic fidelity to uncompressed baselines, especially for non-reasoning models, to prevent unintended content divergence.
Key insights
Linguistic compression impacts LLM cost and accuracy differently based on input versus output channels.
Principles
- Output compression reduces LLM inference costs.
- Input compression increases LLM costs and reduces accuracy.
- Compressed LLM outputs can diverge semantically from uncompressed baselines.
Method
Cavewoman is a two-channel evaluation protocol scoring LLM generations on task accuracy, per-item cost, and reference-text agreement against unconstrained baselines across varied compression levels.
In practice
- Prioritize output compression for cost savings.
- Avoid input compression to maintain accuracy and control costs.
- Verify semantic equivalence of compressed outputs.
Topics
- Large Language Models
- Linguistic Compression
- Inference Cost
- Model Evaluation
- Output Quality
- Semantic Divergence
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.