Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology
Summary
A study empirically assessed prompt engineering strategies for optimizing large language model (LLM) performance in identifying psychological constructs within texts. Researchers evaluated five strategies: codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting, using both zero-shot and few-shot classification. This was applied across three constructs (gratitude, negative core beliefs, positive meaning making) and two models (GPT-4, Llama-3.3). The findings indicate that construct definition, task framing, and examples are the most influential prompt features, while persona, chain-of-thought, and explanations offer limited additional value. The highest alignment with expert judgments resulted from a few-shot prompt combining codebook-guided empirical selection with automatic prompt engineering, achieving F1 scores up to 0.89 for gratitude and 0.76 for positive meaning making with GPT-4.
Key takeaway
For research scientists or ML engineers optimizing LLM performance for text classification in specialized domains like psychology, you should prioritize systematic prompt engineering. Focus on empirically testing diverse baseline prompts, including precise construct definitions and task framing, and integrate few-shot examples. This approach, combining human and automatic prompt generation, significantly enhances alignment with expert judgments, reducing reliance on less effective additive techniques like personas or chain-of-thought.
Key insights
Empirical prompt engineering, focusing on construct definition and examples, significantly improves LLM text classification alignment with human expert judgment.
Principles
- Construct definition and task framing are paramount.
- Empirical prompt selection drives performance gains.
Method
The article proposes a systematic process: generate initial codebook-guided prompts, create diverse variants (human/auto), experiment with few-shot examples, optionally test additive techniques, and conduct final evaluation on a hold-out set.
In practice
- Generate diverse baseline prompt variants.
- Empirically select prompts based on F1 scores.
Topics
- Prompt Engineering
- LLM Text Classification
- Psychological Constructs
- Few-shot Learning
- GPT-4
- Llama-3.3
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.