ConsumerBR: A Large-Scale Corpus of Consumer Complaints in Brazilian Portuguese
Summary
ConsumerBR is a new, large-scale corpus of over 3.1 million consumer complaints and company responses in Brazilian Portuguese, collected from the Consumidor.gov.br platform between 2021 and 2025. This publicly available dataset combines anonymized textual content with structured metadata, including temporal information, complaint outcomes, and consumer satisfaction indicators. The creators developed a specialized data collection strategy for the platform's dynamic interface, a preprocessing pipeline that clusters responses to identify template-based replies, and a hybrid anonymization method to protect privacy. The corpus is characterized by its significant scale, broad coverage, and specific distributional properties, supporting various research applications such as complaint analysis, sentiment modeling, dialogue generation, and preference-based evaluation.
Key takeaway
For research scientists working with natural language processing in Portuguese, ConsumerBR provides an invaluable resource for developing and evaluating models. You should consider integrating this corpus into your projects focused on complaint analysis, sentiment modeling, or dialogue generation, as its scale and rich metadata offer unique opportunities for robust model training and evaluation.
Key insights
ConsumerBR offers a large, anonymized corpus of Brazilian Portuguese consumer complaints for NLP research.
Principles
- Anonymization is crucial for public datasets.
- Dynamic interfaces require tailored data collection.
- Metadata enriches textual corpora for diverse tasks.
Method
The method involves a tailored data collection strategy, a preprocessing pipeline with response clustering for template identification, and a hybrid anonymization approach to mitigate privacy risks.
In practice
- Analyze complaint patterns in Brazilian Portuguese.
- Develop sentiment models for consumer feedback.
- Train dialogue systems for customer service.
Topics
- ConsumerBR Corpus
- Brazilian Portuguese NLP
- Consumer Complaints Data
- Data Anonymization
- Dialogue Generation
Best for: Research Scientist, AI Scientist, NLP Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.