Indigenous Letters to Brazil: Multi-Label Classification
Summary
A study investigates the automatic multi-label classification of 871 indigenous letters to Brazil, annotated across 18 thematic categories, from the "Cartas Indígenas ao Brasil" digital collection. Researchers compared three classification approaches: a lexical model (TF-IDF + logistic regression), a contextual model (BERTimbau-base), and a large language model (LLM) classifier. To mitigate corpus imbalance, class balancing strategies were applied to the neural model. Results showed a precision-recall trade-off, with the lexical model achieving higher precision (0.65) and BERTimbau demonstrating higher recall (0.67), particularly for minority categories. Both models yielded a macro-F1 of 0.42, highlighting the difficulty of multi-label classification in this domain due to corpus imbalance and semantic overlap. The LLM-based classifier also achieved high recall in minority categories but tended to overestimate labels per document. The analysis suggests that hybrid approaches could address individual model limitations, and the corpus and experimental scripts will be publicly released.
Key takeaway
For research scientists developing multi-label classification systems, recognize that corpus imbalance and semantic overlap significantly challenge model performance. You should explore hybrid classification approaches, combining models like TF-IDF and BERTimbau-base, to leverage their complementary strengths in precision and recall, especially for minority classes. Additionally, implement robust class balancing strategies to improve overall model effectiveness.
Key insights
Multi-label classification of indigenous letters presents challenges due to corpus imbalance and semantic overlap.
Principles
- Precision and recall often exhibit a trade-off.
- Hybrid models can overcome individual classifier limitations.
Method
Three classification approaches (lexical, contextual, LLM) were compared on an 871-letter corpus with 18 categories, using class balancing for neural models.
In practice
- Consider hybrid models for complex classification.
- Address corpus imbalance in multi-label tasks.
Topics
- Multi-label Classification
- Indigenous Letters
- BERTimbau
- Large Language Models
- TF-IDF
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.