Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

Researchers at Pacific Northwest National Laboratory developed a scalable, multifactorial experimental framework to systematically test Large Language Model (LLM) sensitivity to subtle semantic changes in pairwise document comparison. Analogizing this as a "needle-in-a-haystack" problem, they embedded a single semantically altered sentence (the needle) within surrounding context (the hay). The study varied perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs (GPT-4o, GPT-5, Claude, Gemini, o4-mini) on tens of thousands of document pairs. Key findings include a within-document positional bias where most models penalize earlier semantic differences more harshly, and that topically unrelated context systematically lowers similarity scores and induces bipolarized scores. Each LLM exhibited a distinct scoring distribution "fingerprint" invariant to perturbation type, yet all models shared a universal hierarchy in how leniently they treated different perturbation types.

Key takeaway

For AI Engineers and Research Scientists evaluating LLM-as-a-judge systems, you should implement fine-grained sensitivity testing beyond standard benchmarks. Your evaluations must consider within-document positional biases and the impact of context coherence, as these factors significantly influence LLM similarity scoring. Be aware that different LLMs possess unique scoring "fingerprints" and treat perturbation types with a consistent hierarchy, necessitating model-specific calibration for reliable comparisons.

Key insights

LLM semantic similarity scores are highly sensitive to document structure, context coherence, and model identity.

Principles

Method

A scalable, factorial experimental design systematically varies semantic perturbation type, context relevance, needle position, and document length to probe LLM-as-a-judge sensitivity in pairwise document similarity.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.