Field of Science and Technology Classification of Academic Documents in Portuguese

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

This study evaluates transformer-based models for automatically classifying academic documents in Portuguese into Field of Science and Technology (FOS) categories. Researchers compared four encoder models (two multilingual, two Portuguese-specific) and five larger decoder-based LLMs using a dataset of 9,696 Portuguese theses, each characterized by its title, keywords, and abstract. Fine-tuned encoder-based models achieved the highest performance, with an F1 score of 88%. This performance significantly surpassed that of general-purpose decoder models prompted for the same classification task, indicating the superior efficacy of task-specific fine-tuning for localized academic domains.

Key takeaway

For research scientists developing automated metadata classification systems for academic repositories, especially with non-English content, you should prioritize fine-tuning encoder-based models over prompting larger, general-purpose LLMs. This approach is likely to yield significantly better F1 scores, as demonstrated by the 88% F1 score achieved for Portuguese FOS classification, ensuring more accurate and reliable metadata generation.

Key insights

Fine-tuned encoder models outperform general LLMs for FOS classification of Portuguese academic documents.

Principles

Method

Evaluated transformer-based models (encoders vs. decoders) on a 9,696-thesis Portuguese dataset using title, keywords, and abstract for FOS classification.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.