Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors
Summary
A study assessed the effectiveness of open-source Large Language Models (LLMs) in inductively coding interviews with 21 Black men who survived community firearm violence. Researchers from the University of Maryland, College Park, including Jessica H. Zhu and Joseph B. Richardson Jr., developed a machine coding pipeline using Llama-3.2-1B-Instruct and Llama-3.1-8B-Instruct models. The goal was to automate the labor-intensive qualitative analysis process, which is crucial for understanding trauma and designing interventions, especially given underfunding in firearm violence research. The findings indicate that while some LLM configurations can identify important codes, overall relevance remains low and is highly sensitive to data processing techniques. Critically, LLM guardrails led to substantial "narrative erasure," with up to 65% of interview data being ignored due to content deemed too graphic or related to sensitive topics like sexual activity, race, or African American English (AAE). The study highlights both the potential for time savings and significant ethical limitations of applying AI in research involving marginalized communities.
Key takeaway
For AI Scientists and Research Scientists working on qualitative data analysis, you should exercise extreme caution when applying LLMs to sensitive, long-form interviews, especially from marginalized communities. Your automated pipelines risk significant narrative erasure and biased outputs due to LLM guardrails and insensitivity to dialects like AAE. Prioritize human-in-the-loop validation and invest in developing difference-aware, low-resourced AI tools that genuinely represent diverse experiences, rather than relying on current models for fully automated inductive coding.
Key insights
LLMs show promise for qualitative coding but struggle with relevance, data sensitivity, and ethical narrative erasure in trauma research.
Principles
- LLM performance in qualitative coding is highly sensitive to data processing.
- LLM guardrails can lead to significant narrative erasure, especially for sensitive topics.
- Larger LLMs do not guarantee substantial performance improvement over smaller models.
Method
A machine coding pipeline used open-source LLMs (Llama 1B, 8B) for zero-shot inductive coding of interview transcripts, followed by BERTopic clustering to generate formal codes. Evaluation relied on "Percent Captured" and "Percent Relevant" metrics.
In practice
- Use BERTopic for clustering LLM-generated codes to reduce volume.
- Validate LLM-generated codes with human experts to mitigate hallucinations.
- Be aware of LLM biases against AAE and traumatic content.
Topics
- Large Language Models
- Qualitative Coding Automation
- Firearm Violence Survivors
- Trauma Narrative Analysis
- LLM Bias and Guardrails
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.