Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI
Summary
This paper introduces a method to enhance text classification for the UN's 17 Sustainable Development Goals (SDGs) by combining multiple machine learning models using Combinatorial Fusion Analysis (CFA). The approach addresses challenges like interconnected SDG concepts and data scarcity by employing generative AI, specifically ChatGPT, to create synthetic training data. Five base models were used: SDG Classy (LDA-based), LinkedSDG (semantic web), SDG Mapper (keyword-based), Convolutional Neural Network (CNN), and Random Forest. The CFA technique, which leverages rank-score characteristic functions and cognitive diversity, achieved an average precision@1 of 96.73% on a 306-document test set, outperforming the best individual model (Model A at 95.42%) and a fine-tuned BERT-base model (94.46%). The study also highlights how CFA can complement human expert judgment and mitigate risks associated with noisy synthetic data.
Key takeaway
For NLP engineers developing text classification systems for complex, interconnected categories like the UN SDGs, adopting a Combinatorial Fusion Analysis (CFA) approach is highly effective. You should consider generating synthetic training data using large language models like ChatGPT to address data scarcity, and then combine multiple diverse base models with CFA to achieve superior accuracy and robustness, potentially surpassing even fine-tuned transformer models. This strategy can also help mitigate the impact of noisy synthetic data.
Key insights
Combinatorial Fusion Analysis (CFA) significantly enhances SDG text classification by combining diverse models and synthetic data.
Principles
- Model fusion improves performance over single optimized models.
- Cognitive diversity quantifies dissimilarity between scoring systems.
- Generative AI can augment scarce labeled training data.
Method
The method involves generating synthetic training data with ChatGPT, training diverse base classifiers, and then combining their outputs using CFA's rank-score and cognitive diversity functions to achieve superior classification accuracy.
In practice
- Use ChatGPT for synthetic data generation to overcome data scarcity.
- Apply CFA to combine multiple classifiers for improved accuracy.
- Consider multi-label classification for interconnected SDG concepts.
Topics
- SDG Text Classification
- Combinatorial Fusion Analysis
- Generative AI
- Model Fusion
- Natural Language Processing
Code references
Best for: NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.