Extending an Ensemble Baseline with Corpus-Based Graph Features for Portuguese Pun Detection

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study presented at PROPOR 2026 investigates enhancing Portuguese pun detection by integrating corpus-based graph features with existing TF-IDF ensemble methods. Researchers constructed three graph representations from the Puntuguese corpus: a Co-occurrence graph, a PPMI-weighted graph, and a Pun-Context graph. Each graph was converted into low-dimensional node embeddings using TruncatedSVD, aggregated into document-level features, and then concatenated with TF-IDF representations within a soft-voting ensemble. Experimental results on the test set indicate that graph-based enrichment does not consistently improve performance. Specifically, Pun-Context and PPMI graphs yielded the strongest augmented results, while combining all graph types degraded overall performance. These findings suggest that the effectiveness of graph-based information is highly dependent on the encoding and aggregation methods for lexical relations at the document level.

Key takeaway

For research scientists developing natural language processing models for nuanced lexical tasks like pun detection, you should consider integrating graph-based features to capture complex contextual interactions. However, carefully evaluate different graph representations and aggregation strategies, as not all combinations will yield performance improvements. Focus on methods like Pun-Context or PPMI-weighted graphs, and avoid naive aggregation of diverse graph types to prevent performance degradation.

Key insights

Graph-based features can augment pun detection, but their utility depends on specific lexical relation encoding.

Principles

Lexical ambiguity challenges linear text representations.
Graph features can capture contextual interactions.
Feature aggregation impacts model performance.

Method

Construct Co-occurrence, PPMI-weighted, and Pun-Context graphs. Convert graphs to low-dimensional node embeddings via TruncatedSVD. Aggregate embeddings into document-level features. Concatenate with TF-IDF in a soft-voting ensemble.

In practice

Experiment with Pun-Context graphs for lexical ambiguity.
Evaluate PPMI-weighted graphs for semantic relations.
Avoid combining all graph types indiscriminately.

Topics

Pun Detection
Portuguese NLP
Graph Features
Ensemble Learning
Lexical Ambiguity

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.