Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models

2026-04-10 · Source: cs.CL updates on arXiv.org · Field: Science & Research — Health & Medical Research, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study developed a large language model (LLM)-based tool to identify HIV-related stigma in clinical narratives from people living with HIV (PLWH) at the University of Florida Health between 2012 and 2022. Researchers identified candidate sentences using expert-curated keywords and clinical word embeddings, then manually annotated 1,332 sentences across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. The study compared encoder-based models like GatorTron-large and BERT with generative LLMs including GPT-OSS-20B, LLaMA-8B, and MedGemma-27B. GatorTron-large achieved the highest overall performance with a Micro F1 score of 0.62. Few-shot prompting significantly improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B reaching Micro-F1 scores of 0.57 and 0.59, respectively. Negative Self-Image was the most predictable subscale, while Personalized Stigma proved the most challenging.

Key takeaway

For NLP engineers developing tools for sensitive clinical data, this research indicates that fine-tuned encoder models like GatorTron-large offer superior performance for specific stigma detection tasks compared to generative LLMs in zero-shot contexts. Consider using few-shot prompting to improve generative model accuracy if you opt for those architectures, but be aware of varying predictability across different stigma categories.

Key insights

LLMs can effectively detect HIV-related stigma in clinical notes, with encoder models outperforming generative models in zero-shot settings.

Principles

Few-shot prompting enhances generative LLM performance.
Stigma subscales vary in detection difficulty.

Method

Candidate sentences were identified via expert keywords and word embeddings, then manually annotated. Models were evaluated using zero-shot and few-shot prompting on four stigma subscales.

In practice

Use GatorTron-large for HIV stigma detection.
Apply few-shot prompting for generative LLMs.

Topics

HIV Stigma Detection
Clinical Note Analysis
Large Language Models
Natural Language Processing
GatorTron-large

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.