Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation
Summary
VISHC at PsyDefDetect introduces a novel framework for classifying Psychological Defense Mechanisms (PDMs) from text, addressing significant challenges like data scarcity and class imbalance in low-resource clinical settings. Their approach combines context-aware synthetic data augmentation, using Llama-3-8B-Instruct with theory-driven prompts based on the Defense Mechanisms Rating Scales (DMRS), and a hybrid classification model. This model integrates contextual language representations from MentalRoBERTa with 150 structured clinical features derived from DMRS indicators and basic linguistic heuristics. The method achieved an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%) on the PsyDefConv blind-test set, outperforming the DMRS Co-Pilot baseline. A key finding is that the quality of definitions in prompting directly influences generation fidelity and downstream performance, although a "Label 7 sink effect" (majority class bias) remains a limitation.
Key takeaway
For NLP Engineers developing clinical text classification systems, your approach to data augmentation and feature engineering significantly impacts model reliability. Prioritize high-quality, theory-driven prompts for synthetic data generation and integrate domain-specific clinical features with contextual language models. Be aware that even with augmentation, severe class imbalance and semantic overlap may require advanced techniques like loss re-weighting or contrastive learning to prevent majority-class bias.
Key insights
Context-aware synthetic data augmentation with theory-driven prompts significantly improves psychological defense mechanism classification in low-resource settings.
Principles
- Definition quality in prompting governs generation fidelity.
- Hybrid models fusing clinical features and language representations enhance classification.
- Naive data balancing is insufficient for highly overlapping classes.
Method
The method involves LLM-based stressor identification, context-aware synthetic data augmentation using DMRS definitions, dual-domain feature extraction (linguistic heuristics + DMRS profiles), and a hybrid fusion architecture for classification.
In practice
- Use DMRS-based definitions for psychologically grounded LLM prompting.
- Implement secondary classifiers for synthetic data quality control.
- Consider focal loss or hard-negative mining for class imbalance.
Topics
- Psychological Defense Mechanisms
- Context-Aware Data Augmentation
- Hybrid Classification Model
- PsyDefDetect Shared Task
- Defense Mechanisms Rating Scales
Code references
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.