When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Science & Research — Social Sciences & Behavioral Studies, Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, extended

Summary

The study investigates the predictive performance and behavioral reliability of Large Language Models (LLMs) in political event coding, a complex source–target relation classification task. Researchers tested whether expert codebooks, when operationalized into LLM-friendly forms with clearer definitions, examples, and rules, improve performance. Using four open-source instruction-tuned models (Gemma2:9B, Qwen2.5:7B, Mistral-7B-Instruct, Llama-3.1-8B-Instruct) on PLOVER-based PLV and AW datasets, they found that "Enriched" codebooks substantially improve fine-grained classification, with root-level macro-F1 scores highest under this condition. However, these predictive gains did not fully translate into behavioral reliability. Models often failed tests involving controlled changes to label names, codebook order, or label–definition mappings, indicating reliance on familiar semantics over explicit coding rules.

Key takeaway

For research scientists developing LLM-based text measurement systems, you should prioritize evaluating behavioral reliability alongside predictive accuracy. While enriched codebooks improve classification, your models may still rely on semantic shortcuts rather than strictly following coding rules. Implement controlled codebook perturbation tests, like generic-label or swapped-mapping probes, to ensure your LLM preserves the intended coding logic, thereby safeguarding the validity of your downstream social-science data.

Key insights

LLM accuracy in codebook-guided tasks does not guarantee behavioral reliability or adherence to underlying coding logic.

Principles

Richer codebook operationalization improves fine-grained classification.
Predictive performance and behavioral reliability can decouple in LLMs.
LLMs may prioritize familiar semantics over explicit codebook rules.

Method

Convert expert codebooks into LLM-usable prompts with definitions, examples, boundary rules, and retrieved context. Evaluate predictive performance and behavioral reliability using controlled codebook perturbations.

In practice

Operationalize codebooks with detailed examples and boundary notes.
Test LLMs with generic labels and swapped label-definition mappings.
Use Rule-following Score (Rule-S) for behavioral reliability.

Topics

Large Language Models
Political Event Coding
Codebook Operationalization
Behavioral Reliability
Text Classification
Social Science Research

Code references

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.