Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

2026-05-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

VISHC at PsyDefDetect introduces a novel framework for classifying Psychological Defense Mechanisms (PDMs) from text, addressing significant challenges like data scarcity and class imbalance in low-resource clinical settings. Their approach combines context-aware synthetic data augmentation, using Llama-3-8B-Instruct with theory-driven prompts based on the Defense Mechanisms Rating Scales (DMRS), and a hybrid classification model. This model integrates contextual language representations from MentalRoBERTa with 150 structured clinical features derived from DMRS indicators and basic linguistic heuristics. The method achieved an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%) on the PsyDefConv blind-test set, outperforming the DMRS Co-Pilot baseline. A key finding is that the quality of definitions in prompting directly influences generation fidelity and downstream performance, although a "Label 7 sink effect" (majority class bias) remains a limitation.

Key takeaway

For NLP Engineers developing clinical text classification systems, your approach to data augmentation and feature engineering significantly impacts model reliability. Prioritize high-quality, theory-driven prompts for synthetic data generation and integrate domain-specific clinical features with contextual language models. Be aware that even with augmentation, severe class imbalance and semantic overlap may require advanced techniques like loss re-weighting or contrastive learning to prevent majority-class bias.

Key insights

Context-aware synthetic data augmentation with theory-driven prompts significantly improves psychological defense mechanism classification in low-resource settings.

Principles

Definition quality in prompting governs generation fidelity.
Hybrid models fusing clinical features and language representations enhance classification.
Naive data balancing is insufficient for highly overlapping classes.

Method

The method involves LLM-based stressor identification, context-aware synthetic data augmentation using DMRS definitions, dual-domain feature extraction (linguistic heuristics + DMRS profiles), and a hybrid fusion architecture for classification.

In practice

Use DMRS-based definitions for psychologically grounded LLM prompting.
Implement secondary classifiers for synthetic data quality control.
Consider focal loss or hard-negative mining for class imbalance.

Topics

Psychological Defense Mechanisms
Context-Aware Data Augmentation
Hybrid Classification Model
PsyDefDetect Shared Task
Defense Mechanisms Rating Scales

Code references

htdgv/CASA-PDC

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.