Automatic Question classification in Portuguese: A Large-Scale Dataset and Comparative Evaluation of Classification Strategies
Summary
Murilo Boccardo and Valéria D. Feltrim presented a comparative evaluation of automatic classification strategies for Brazilian university entrance exam questions, categorizing them by subject and fine-grained topic. Their central contribution is a new large-scale Portuguese-language dataset, curated from approximately 17,000 questions from the Agatha.edu platform, which were meticulously cleaned and normalized. The study explored two classification approaches: a single-step method directly predicting fine-grained topics and a two-stage method involving initial subject prediction followed by specialized topic classifiers. These strategies were assessed using both classical machine learning techniques, including Support Vector Machines, Naive Bayes, and Random Forest, and transformer-based language models pre-trained for Portuguese. Experimental results confirm the viability of large-scale automatic question classification and underscore the utility of NLP-based methods for managing educational question banks.
Key takeaway
For research scientists developing educational technology in Portuguese-speaking markets, this work demonstrates the feasibility of large-scale automatic question classification. You should consider adopting a two-stage classification strategy, combining classical machine learning with transformer-based language models, to accurately categorize exam questions by subject and fine-grained topic. This approach can significantly improve the curation and organization of your educational question banks.
Key insights
A new 17,000-question Portuguese dataset enables large-scale automatic classification of exam questions.
Principles
- Two-stage classification can refine topic prediction.
- NLP strategies enhance educational question bank management.
Method
The study compared single-step fine-grained topic prediction against a two-stage approach: subject prediction followed by specialized topic classifiers, using both classical ML and transformer models.
In practice
- Use transformer models for Portuguese question classification.
- Implement a two-stage classification for granular topics.
Topics
- Automatic Question Classification
- Portuguese Language Dataset
- Educational Question Banks
- Transformer Language Models
- Machine Learning Classification
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.