Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, AI in Psychology · Depth: Expert, extended

Summary

A study empirically assessed prompt engineering strategies for optimizing large language model (LLM) performance in identifying psychological constructs within texts. Researchers evaluated five strategies: codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting, using both zero-shot and few-shot classification. This was applied across three constructs (gratitude, negative core beliefs, positive meaning making) and two models (GPT-4, Llama-3.3). The findings indicate that construct definition, task framing, and examples are the most influential prompt features, while persona, chain-of-thought, and explanations offer limited additional value. The highest alignment with expert judgments resulted from a few-shot prompt combining codebook-guided empirical selection with automatic prompt engineering, achieving F1 scores up to 0.89 for gratitude and 0.76 for positive meaning making with GPT-4.

Key takeaway

For research scientists or ML engineers optimizing LLM performance for text classification in specialized domains like psychology, you should prioritize systematic prompt engineering. Focus on empirically testing diverse baseline prompts, including precise construct definitions and task framing, and integrate few-shot examples. This approach, combining human and automatic prompt generation, significantly enhances alignment with expert judgments, reducing reliance on less effective additive techniques like personas or chain-of-thought.

Key insights

Empirical prompt engineering, focusing on construct definition and examples, significantly improves LLM text classification alignment with human expert judgment.

Principles

Construct definition and task framing are paramount.
Empirical prompt selection drives performance gains.

Method

The article proposes a systematic process: generate initial codebook-guided prompts, create diverse variants (human/auto), experiment with few-shot examples, optionally test additive techniques, and conduct final evaluation on a hold-out set.

In practice

Generate diverse baseline prompt variants.
Empirically select prompts based on F1 scores.

Topics

Prompt Engineering
LLM Text Classification
Psychological Constructs
Few-shot Learning
GPT-4
Llama-3.3

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.