Automated Essay Scoring for Brazilian Portuguese. Evidence from Cross-Prompt Evaluation of ENEM Essays
Summary
A study investigated Automated Essay Scoring (AES) for Brazilian Portuguese, specifically targeting the high-stakes ENEM assessment, which evaluates millions of student essays annually. The research addressed the challenge of fragmented datasets and the complexity of ENEM's multi-trait rubric by using a corpus of 385 essays across 38 prompts. Models were tasked with evaluating essays on unseen prompts across five traits, each scored on a six-point ordinal scale. The study compared three model classes: feature-based methods utilizing 72 features, encoder-only transformers ranging from 109M to 1.5B parameters, and decoder architectures with 2.4B to 671B parameters, tested in both fine-tuned and zero-shot configurations. Findings indicate that encoder models perform well on mechanical traits like fluency and cohesion, while decoder models achieve superior performance on argumentation (QWK 0.73) and writing style (QWK 0.60) with full context. Language-specific pretraining was found to benefit only surface-level features, not complex reasoning. Best models achieved QWK scores between 0.60 and 0.73.
Key takeaway
For research scientists developing AES systems for high-stakes assessments like ENEM, you should consider a hybrid approach, leveraging encoder models for mechanical traits and decoder models for complex traits like argumentation and writing style. Your choice of model architecture should be guided by the specific essay trait being evaluated and the availability of contextual information, as no single model type optimally addresses all evaluation needs. Prioritize full context for decoder models to maximize performance on higher-order reasoning.
Key insights
Cross-prompt AES for Brazilian Portuguese reveals distinct model strengths across essay traits and context availability.
Principles
- Encoder models suit mechanical traits.
- Decoder models excel in argumentation with full context.
- Language-specific pretraining aids surface features.
Method
Evaluated feature-based, encoder-only (109M-1.5B params), and decoder (2.4B-671B params) models on 385 Brazilian Portuguese essays across 38 unseen prompts for trait-specific scoring.
In practice
- Use encoders for fluency/cohesion.
- Employ decoders for argumentation/style.
- Consider context needs for model choice.
Topics
- Automated Essay Scoring
- Brazilian Portuguese
- ENEM Assessment
- Cross-Prompt Evaluation
- Transformer Architectures
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.