When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding
Summary
The study investigates the predictive performance and behavioral reliability of Large Language Models (LLMs) in political event coding, a complex source–target relation classification task. Researchers tested whether expert codebooks, when operationalized into LLM-friendly forms with clearer definitions, examples, and rules, improve performance. Using four open-source instruction-tuned models (Gemma2:9B, Qwen2.5:7B, Mistral-7B-Instruct, Llama-3.1-8B-Instruct) on PLOVER-based PLV and AW datasets, they found that "Enriched" codebooks substantially improve fine-grained classification, with root-level macro-F1 scores highest under this condition. However, these predictive gains did not fully translate into behavioral reliability. Models often failed tests involving controlled changes to label names, codebook order, or label–definition mappings, indicating reliance on familiar semantics over explicit coding rules.
Key takeaway
For research scientists developing LLM-based text measurement systems, you should prioritize evaluating behavioral reliability alongside predictive accuracy. While enriched codebooks improve classification, your models may still rely on semantic shortcuts rather than strictly following coding rules. Implement controlled codebook perturbation tests, like generic-label or swapped-mapping probes, to ensure your LLM preserves the intended coding logic, thereby safeguarding the validity of your downstream social-science data.
Key insights
LLM accuracy in codebook-guided tasks does not guarantee behavioral reliability or adherence to underlying coding logic.
Principles
- Richer codebook operationalization improves fine-grained classification.
- Predictive performance and behavioral reliability can decouple in LLMs.
- LLMs may prioritize familiar semantics over explicit codebook rules.
Method
Convert expert codebooks into LLM-usable prompts with definitions, examples, boundary rules, and retrieved context. Evaluate predictive performance and behavioral reliability using controlled codebook perturbations.
In practice
- Operationalize codebooks with detailed examples and boundary notes.
- Test LLMs with generic labels and swapped label-definition mappings.
- Use Rule-following Score (Rule-S) for behavioral reliability.
Topics
- Large Language Models
- Political Event Coding
- Codebook Operationalization
- Behavioral Reliability
- Text Classification
- Social Science Research
Code references
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.