Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

2025-12-20 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Health & Medical Research, Social Sciences & Behavioral Studies, Research Methodology & Innovation · Depth: Advanced, extended

Summary

A study assessed the effectiveness of open-source Large Language Models (LLMs) in inductively coding interviews with 21 Black men who survived community firearm violence. Researchers from the University of Maryland, College Park, including Jessica H. Zhu and Joseph B. Richardson Jr., developed a machine coding pipeline using Llama-3.2-1B-Instruct and Llama-3.1-8B-Instruct models. The goal was to automate the labor-intensive qualitative analysis process, which is crucial for understanding trauma and designing interventions, especially given underfunding in firearm violence research. The findings indicate that while some LLM configurations can identify important codes, overall relevance remains low and is highly sensitive to data processing techniques. Critically, LLM guardrails led to substantial "narrative erasure," with up to 65% of interview data being ignored due to content deemed too graphic or related to sensitive topics like sexual activity, race, or African American English (AAE). The study highlights both the potential for time savings and significant ethical limitations of applying AI in research involving marginalized communities.

Key takeaway

For AI Scientists and Research Scientists working on qualitative data analysis, you should exercise extreme caution when applying LLMs to sensitive, long-form interviews, especially from marginalized communities. Your automated pipelines risk significant narrative erasure and biased outputs due to LLM guardrails and insensitivity to dialects like AAE. Prioritize human-in-the-loop validation and invest in developing difference-aware, low-resourced AI tools that genuinely represent diverse experiences, rather than relying on current models for fully automated inductive coding.

Key insights

LLMs show promise for qualitative coding but struggle with relevance, data sensitivity, and ethical narrative erasure in trauma research.

Principles

LLM performance in qualitative coding is highly sensitive to data processing.
LLM guardrails can lead to significant narrative erasure, especially for sensitive topics.
Larger LLMs do not guarantee substantial performance improvement over smaller models.

Method

A machine coding pipeline used open-source LLMs (Llama 1B, 8B) for zero-shot inductive coding of interview transcripts, followed by BERTopic clustering to generate formal codes. Evaluation relied on "Percent Captured" and "Percent Relevant" metrics.

In practice

Use BERTopic for clustering LLM-generated codes to reduce volume.
Validate LLM-generated codes with human experts to mitigate hallucinations.
Be aware of LLM biases against AAE and traumatic content.

Topics

Large Language Models
Qualitative Coding Automation
Firearm Violence Survivors
Trauma Narrative Analysis
LLM Bias and Guardrails

Code references

jhzsquared/AIvsHumanCoding

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.