Automated Essay Scoring for Brazilian Portuguese. Evidence from Cross-Prompt Evaluation of ENEM Essays

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A study investigated Automated Essay Scoring (AES) for Brazilian Portuguese, specifically targeting the high-stakes ENEM assessment, which evaluates millions of student essays annually. The research addressed the challenge of fragmented datasets and the complexity of ENEM's multi-trait rubric by using a corpus of 385 essays across 38 prompts. Models were tasked with evaluating essays on unseen prompts across five traits, each scored on a six-point ordinal scale. The study compared three model classes: feature-based methods utilizing 72 features, encoder-only transformers ranging from 109M to 1.5B parameters, and decoder architectures with 2.4B to 671B parameters, tested in both fine-tuned and zero-shot configurations. Findings indicate that encoder models perform well on mechanical traits like fluency and cohesion, while decoder models achieve superior performance on argumentation (QWK 0.73) and writing style (QWK 0.60) with full context. Language-specific pretraining was found to benefit only surface-level features, not complex reasoning. Best models achieved QWK scores between 0.60 and 0.73.

Key takeaway

For research scientists developing AES systems for high-stakes assessments like ENEM, you should consider a hybrid approach, leveraging encoder models for mechanical traits and decoder models for complex traits like argumentation and writing style. Your choice of model architecture should be guided by the specific essay trait being evaluated and the availability of contextual information, as no single model type optimally addresses all evaluation needs. Prioritize full context for decoder models to maximize performance on higher-order reasoning.

Key insights

Cross-prompt AES for Brazilian Portuguese reveals distinct model strengths across essay traits and context availability.

Principles

Method

Evaluated feature-based, encoder-only (109M-1.5B params), and decoder (2.4B-671B params) models on 385 Brazilian Portuguese essays across 38 unseen prompts for trait-specific scoring.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.