I spent 4 days training an AI model to understand my country. It barely works

2026-06-21 · Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

An AI model was trained over four days to analyze Nepali public sentiment regarding Prime Minister Balendra Shah's remarks on the Nepal-India border dispute. The project involved scraping 2,266 YouTube comments from 17 channels after encountering restrictions on Facebook and Twitter APIs and a lack of comments on news sites. These comments were then translated from Nepali to English using Google Translate, which introduced significant nuance loss and grammatical issues, especially with Romanized Nepali. Initial auto-labeling with `cardiffnlp/twitter-xlm-roberta-base-sentiment` showed 20-30% inaccuracy, necessitating manual re-labeling of 300 comments for a "gold standard" evaluation set, while ~1,800 noisy auto-labeled comments formed a "silver standard" training set. A critical data leakage bug, caused by duplicate comments across train/eval splits, initially inflated accuracy to 78%, but after correction, the fine-tuned XLM-RoBERTa model achieved an honest 50% accuracy on the 3-class sentiment task. This result, while modest, establishes a baseline for low-resource Nepali NLP.

Key takeaway

For NLP engineers building models in low-resource languages, you should anticipate significant data acquisition and quality challenges. Prioritize robust data pipeline design, including text-based deduplication, to prevent inflated performance metrics. Your initial model accuracy might be modest due to translation noise and limited data, but this establishes a crucial baseline. Consider contributing to open-source efforts to improve foundational NLP infrastructure for underrepresented languages.

Key insights

Low-resource NLP projects face significant challenges in data collection, translation, and accurate labeling, often yielding modest baselines.

Principles

Public API restrictions complicate social media data scraping.
Machine translation can distort sentiment in idiomatic languages.
Data leakage between train/eval sets inflates model performance.

Method

Scrape public comments, translate, auto-label, manually correct a gold standard, train on noisy silver standard, and deduplicate by text.

In practice

Use `youtube-comment-downloader` for public YouTube comments.
Implement text-based deduplication for robust train/eval splits.
Establish a small, human-verified "gold standard" for honest evaluation.

Topics

Low-Resource NLP
Sentiment Analysis
Data Scraping
Machine Translation
XLM-RoBERTa
Data Leakage
Nepali Language

Code references

samirasharma/nepal-border-sentiment

Best for: AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.