Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new experimental framework systematically probes Large Language Model (LLM) sensitivity to subtle semantic changes in pairwise document comparison, likening it to a "needle-in-a-haystack" problem. Researchers embedded a single semantically altered sentence (the needle) within surrounding context (the hay), varying perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length. Testing five LLMs on tens of thousands of document pairs, the analysis revealed that most models penalize semantic differences more harshly when they occur earlier in a document. Topically unrelated context systematically lowers similarity scores and induces bipolarized scores. Each LLM produces a distinct scoring distribution, a stable "fingerprint" invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types.

Key takeaway

For AI Engineers evaluating LLMs for document similarity tasks, you should be aware that model scores are highly sensitive to the position of semantic changes and the topical coherence of surrounding text. Your choice of LLM will also introduce a unique scoring "fingerprint." Consider using the proposed multifactorial framework to systematically audit and compare models, especially when subtle semantic differences are critical.

Key insights

LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity.

Principles

Method

The framework embeds a semantically altered sentence ("needle") within context ("hay"), varying perturbation type, context type, needle position, and document length to test LLM sensitivity.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.