Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study presents a manual annotation of lexical and orthographic variations found in DANTEStocks, a corpus of Brazilian Portuguese financial tweets from Twitter/X. Researchers utilized a hierarchical typology to categorize both creative language use and deviations from standard Portuguese norms. The analysis revealed that orthographic variation is significantly shaped by creative forms, often specific to the platform and financial domain. Deviations from standard norms, such as predictable omissions of diacritics and cedillas, were systematic. Most tokens exhibited only a single variation phenomenon, indicating stable and independent patterns within this Twitter subgenre. The identified variant forms were compiled into a lexicon, which was then used to evaluate the performance of BERTimbau, Word2Vec, and FastText embedding models on raw, unnormalized data. The lexicon successfully reduced out-of-vocabulary rates and enhanced coverage for these models.

Key takeaway

For research scientists developing NLP tools for financial social media, understanding and addressing lexical and orthographic variation is crucial. Your models, like BERTimbau or FastText, will benefit from curated lexical resources that reduce out-of-vocabulary rates and improve coverage on unnormalized data. Consider building domain-specific lexicons to enhance model robustness and accuracy in handling non-canonical language.

Key insights

Lexical and orthographic variations in financial tweets are systematic, influencing embedding model performance and requiring curated lexicons.

Principles

Orthographic variation is driven by creative, domain-specific forms.
Standard norm variation is systematic, involving predictable omissions.

Method

Manual annotation of financial tweets using a hierarchical typology to categorize lexical and orthographic phenomena, followed by lexicon creation and embedding model evaluation.

In practice

Curate lexicons for domain-specific social media data.
Evaluate embedding models with unnormalized, real-world text.

Topics

Lexical Variation
Orthographic Variation
Financial Tweets
Brazilian Portuguese
DANTEStocks Corpus

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.