Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Summary
A new experimental framework systematically probes Large Language Model (LLM) sensitivity to subtle semantic changes in pairwise document comparison, likening it to a "needle-in-a-haystack" problem. Researchers embedded a single semantically altered sentence (the needle) within surrounding context (the hay), varying perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length. Testing five LLMs on tens of thousands of document pairs, the analysis revealed that most models penalize semantic differences more harshly when they occur earlier in a document. Topically unrelated context systematically lowers similarity scores and induces bipolarized scores. Each LLM produces a distinct scoring distribution, a stable "fingerprint" invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types.
Key takeaway
For AI Engineers evaluating LLMs for document similarity tasks, you should be aware that model scores are highly sensitive to the position of semantic changes and the topical coherence of surrounding text. Your choice of LLM will also introduce a unique scoring "fingerprint." Consider using the proposed multifactorial framework to systematically audit and compare models, especially when subtle semantic differences are critical.
Key insights
LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity.
Principles
- LLMs exhibit within-document positional bias.
- Context coherence impacts similarity scoring.
- Each LLM has a distinct scoring "fingerprint".
Method
The framework embeds a semantically altered sentence ("needle") within context ("hay"), varying perturbation type, context type, needle position, and document length to test LLM sensitivity.
In practice
- Audit LLM scoring behavior with varied context.
- Compare different LLMs using this framework.
- Evaluate positional bias in similarity tasks.
Topics
- LLM-as-a-Judge
- Semantic Similarity Scoring
- Sensitivity Testing
- Positional Bias
- Context Coherence
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.