I spent 4 days training an AI model to understand my country. It barely works
Summary
An AI model was trained over four days to analyze Nepali public sentiment regarding Prime Minister Balendra Shah's remarks on the Nepal-India border dispute. The project involved scraping 2,266 YouTube comments from 17 channels after encountering restrictions on Facebook and Twitter APIs and a lack of comments on news sites. These comments were then translated from Nepali to English using Google Translate, which introduced significant nuance loss and grammatical issues, especially with Romanized Nepali. Initial auto-labeling with `cardiffnlp/twitter-xlm-roberta-base-sentiment` showed 20-30% inaccuracy, necessitating manual re-labeling of 300 comments for a "gold standard" evaluation set, while ~1,800 noisy auto-labeled comments formed a "silver standard" training set. A critical data leakage bug, caused by duplicate comments across train/eval splits, initially inflated accuracy to 78%, but after correction, the fine-tuned XLM-RoBERTa model achieved an honest 50% accuracy on the 3-class sentiment task. This result, while modest, establishes a baseline for low-resource Nepali NLP.
Key takeaway
For NLP engineers building models in low-resource languages, you should anticipate significant data acquisition and quality challenges. Prioritize robust data pipeline design, including text-based deduplication, to prevent inflated performance metrics. Your initial model accuracy might be modest due to translation noise and limited data, but this establishes a crucial baseline. Consider contributing to open-source efforts to improve foundational NLP infrastructure for underrepresented languages.
Key insights
Low-resource NLP projects face significant challenges in data collection, translation, and accurate labeling, often yielding modest baselines.
Principles
- Public API restrictions complicate social media data scraping.
- Machine translation can distort sentiment in idiomatic languages.
- Data leakage between train/eval sets inflates model performance.
Method
Scrape public comments, translate, auto-label, manually correct a gold standard, train on noisy silver standard, and deduplicate by text.
In practice
- Use `youtube-comment-downloader` for public YouTube comments.
- Implement text-based deduplication for robust train/eval splits.
- Establish a small, human-verified "gold standard" for honest evaluation.
Topics
- Low-Resource NLP
- Sentiment Analysis
- Data Scraping
- Machine Translation
- XLM-RoBERTa
- Data Leakage
- Nepali Language
Code references
Best for: AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.