Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI

2026-02-13 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

This paper introduces a method to enhance text classification for the UN's 17 Sustainable Development Goals (SDGs) by combining multiple machine learning models using Combinatorial Fusion Analysis (CFA). The approach addresses challenges like interconnected SDG concepts and data scarcity by employing generative AI, specifically ChatGPT, to create synthetic training data. Five base models were used: SDG Classy (LDA-based), LinkedSDG (semantic web), SDG Mapper (keyword-based), Convolutional Neural Network (CNN), and Random Forest. The CFA technique, which leverages rank-score characteristic functions and cognitive diversity, achieved an average precision@1 of 96.73% on a 306-document test set, outperforming the best individual model (Model A at 95.42%) and a fine-tuned BERT-base model (94.46%). The study also highlights how CFA can complement human expert judgment and mitigate risks associated with noisy synthetic data.

Key takeaway

For NLP engineers developing text classification systems for complex, interconnected categories like the UN SDGs, adopting a Combinatorial Fusion Analysis (CFA) approach is highly effective. You should consider generating synthetic training data using large language models like ChatGPT to address data scarcity, and then combine multiple diverse base models with CFA to achieve superior accuracy and robustness, potentially surpassing even fine-tuned transformer models. This strategy can also help mitigate the impact of noisy synthetic data.

Key insights

Combinatorial Fusion Analysis (CFA) significantly enhances SDG text classification by combining diverse models and synthetic data.

Principles

Model fusion improves performance over single optimized models.
Cognitive diversity quantifies dissimilarity between scoring systems.
Generative AI can augment scarce labeled training data.

Method

The method involves generating synthetic training data with ChatGPT, training diverse base classifiers, and then combining their outputs using CFA's rank-score and cognitive diversity functions to achieve superior classification accuracy.

In practice

Use ChatGPT for synthetic data generation to overcome data scarcity.
Apply CFA to combine multiple classifiers for improved accuracy.
Consider multi-label classification for interconnected SDG concepts.

Topics

SDG Text Classification
Combinatorial Fusion Analysis
Generative AI
Model Fusion
Natural Language Processing

Code references

Best for: NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.