An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection
Summary
COVA-X is an expanded synthetic multi-turn conversational smishing dataset, now comprising 10,985 conversations across eight elder-targeted scam categories. This dataset was generated using an improved pipeline that addresses contamination, label mismatch, stage-direction bleed, and prompt-design failures identified in its predecessor, COVA. Retraining classifiers on COVA-X revealed that the Longformer model significantly outperforms XGBoost, achieving 79.71% accuracy and 0.7786 macro F1, compared to XGBoost's 78.43% accuracy and 0.7563 macro F1. This performance shift confirms that transformer models require larger conversational corpora to leverage their contextual advantages effectively. The dataset's quality lifecycle also shows a 12.7x improvement in label correction rate, from 49.8% to 3.9%, and a reduction in virtual-kidnapping artifact rates from 67.1% to 46.5%. A pre/post-cleanup sensitivity analysis further validates that dataset refinement enhances label-relevant signal across all classifier architectures.
Key takeaway
For NLP Engineers developing smishing detection systems, this work highlights the critical need for extensive, high-quality conversational datasets. You should prioritize expanding your training corpora, especially for transformer-based models, to fully realize their contextual advantages. Implementing a rigorous dataset quality lifecycle, including artifact reduction and label correction, will directly improve your model's ability to recover genuine signals and enhance overall detection accuracy.
Key insights
Expanded synthetic datasets significantly improve transformer model performance in multi-turn smishing detection.
Principles
- Transformer models need large conversational corpora.
- Dataset quality directly impacts model signal recovery.
- Scam categories modulate detection outcomes.
Method
An improved generation pipeline addressed contamination, label mismatch, stage-direction bleed, and prompt-design failures to create a higher-quality synthetic dataset.
In practice
- Expand synthetic datasets for transformer training.
- Implement quality lifecycle for dataset refinement.
- Analyze scam-type specific detection mechanisms.
Topics
- Smishing Detection
- Synthetic Data Generation
- Transformer Models
- Conversational AI
- Dataset Quality
- Elder Fraud
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.