Indigenous Letters to Brazil: Multi-Label Classification

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, medium

Summary

A study investigates the automatic multi-label classification of 871 indigenous letters to Brazil, annotated across 18 thematic categories, from the "Cartas Indígenas ao Brasil" digital collection. Researchers compared three classification approaches: a lexical model (TF-IDF + logistic regression), a contextual model (BERTimbau-base), and a large language model (LLM) classifier. To mitigate corpus imbalance, class balancing strategies were applied to the neural model. Results showed a precision-recall trade-off, with the lexical model achieving higher precision (0.65) and BERTimbau demonstrating higher recall (0.67), particularly for minority categories. Both models yielded a macro-F1 of 0.42, highlighting the difficulty of multi-label classification in this domain due to corpus imbalance and semantic overlap. The LLM-based classifier also achieved high recall in minority categories but tended to overestimate labels per document. The analysis suggests that hybrid approaches could address individual model limitations, and the corpus and experimental scripts will be publicly released.

Key takeaway

For research scientists developing multi-label classification systems, recognize that corpus imbalance and semantic overlap significantly challenge model performance. You should explore hybrid classification approaches, combining models like TF-IDF and BERTimbau-base, to leverage their complementary strengths in precision and recall, especially for minority classes. Additionally, implement robust class balancing strategies to improve overall model effectiveness.

Key insights

Multi-label classification of indigenous letters presents challenges due to corpus imbalance and semantic overlap.

Principles

Precision and recall often exhibit a trade-off.
Hybrid models can overcome individual classifier limitations.

Method

Three classification approaches (lexical, contextual, LLM) were compared on an 871-letter corpus with 18 categories, using class balancing for neural models.

In practice

Consider hybrid models for complex classification.
Address corpus imbalance in multi-label tasks.

Topics

Multi-label Classification
Indigenous Letters
BERTimbau
Large Language Models
TF-IDF

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.