Extending an Ensemble Baseline with Corpus-Based Graph Features for Portuguese Pun Detection
Summary
A study presented at PROPOR 2026 investigates enhancing Portuguese pun detection by integrating corpus-based graph features with existing TF-IDF ensemble methods. Researchers constructed three graph representations from the Puntuguese corpus: a Co-occurrence graph, a PPMI-weighted graph, and a Pun-Context graph. Each graph was converted into low-dimensional node embeddings using TruncatedSVD, aggregated into document-level features, and then concatenated with TF-IDF representations within a soft-voting ensemble. Experimental results on the test set indicate that graph-based enrichment does not consistently improve performance. Specifically, Pun-Context and PPMI graphs yielded the strongest augmented results, while combining all graph types degraded overall performance. These findings suggest that the effectiveness of graph-based information is highly dependent on the encoding and aggregation methods for lexical relations at the document level.
Key takeaway
For research scientists developing natural language processing models for nuanced lexical tasks like pun detection, you should consider integrating graph-based features to capture complex contextual interactions. However, carefully evaluate different graph representations and aggregation strategies, as not all combinations will yield performance improvements. Focus on methods like Pun-Context or PPMI-weighted graphs, and avoid naive aggregation of diverse graph types to prevent performance degradation.
Key insights
Graph-based features can augment pun detection, but their utility depends on specific lexical relation encoding.
Principles
- Lexical ambiguity challenges linear text representations.
- Graph features can capture contextual interactions.
- Feature aggregation impacts model performance.
Method
Construct Co-occurrence, PPMI-weighted, and Pun-Context graphs. Convert graphs to low-dimensional node embeddings via TruncatedSVD. Aggregate embeddings into document-level features. Concatenate with TF-IDF in a soft-voting ensemble.
In practice
- Experiment with Pun-Context graphs for lexical ambiguity.
- Evaluate PPMI-weighted graphs for semantic relations.
- Avoid combining all graph types indiscriminately.
Topics
- Pun Detection
- Portuguese NLP
- Graph Features
- Ensemble Learning
- Lexical Ambiguity
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.