Field of Science and Technology Classification of Academic Documents in Portuguese
Summary
This study evaluates transformer-based models for automatically classifying academic documents in Portuguese into Field of Science and Technology (FOS) categories. Researchers compared four encoder models (two multilingual, two Portuguese-specific) and five larger decoder-based LLMs using a dataset of 9,696 Portuguese theses, each characterized by its title, keywords, and abstract. Fine-tuned encoder-based models achieved the highest performance, with an F1 score of 88%. This performance significantly surpassed that of general-purpose decoder models prompted for the same classification task, indicating the superior efficacy of task-specific fine-tuning for localized academic domains.
Key takeaway
For research scientists developing automated metadata classification systems for academic repositories, especially with non-English content, you should prioritize fine-tuning encoder-based models over prompting larger, general-purpose LLMs. This approach is likely to yield significantly better F1 scores, as demonstrated by the 88% F1 score achieved for Portuguese FOS classification, ensuring more accurate and reliable metadata generation.
Key insights
Fine-tuned encoder models outperform general LLMs for FOS classification of Portuguese academic documents.
Principles
- Task-specific fine-tuning is effective.
- Localized domains benefit from specialized models.
Method
Evaluated transformer-based models (encoders vs. decoders) on a 9,696-thesis Portuguese dataset using title, keywords, and abstract for FOS classification.
In practice
- Use fine-tuned encoders for text classification.
- Prioritize domain-specific models for localized content.
Topics
- Field of Science and Technology Classification
- Academic Document Classification
- Transformer Models
- Encoder-Decoder Models
- Portuguese Language Processing
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.