Machine learning and digital pragmatics: Which word category influences emoji use most?

2026-04-24 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, short

Summary

A study investigated the use of Machine Learning (ML) for predicting emoji usage in Arabic tweets, specifically focusing on the influence of word categories. Researchers collected a corpus of 11,379 Arabic colloquial tweets from X.com via Python, refining it to a net dataset of 8,695 tweets for analysis. These tweets were classified into 14 numerically encoded categories, serving as labels. A preprocessing pipeline was established as an interpretable baseline to examine the relationship between lexical features and emoji categories. The MARBERT model was fine-tuned for emoji prediction from textual input, achieving an overall accuracy of 0.75. The findings suggest promising results but highlight the ongoing need for improving ML models, including MARBERT, particularly for low-resource and multidialectal languages like Arabic.

Key takeaway

For research scientists developing natural language processing models for low-resource or multidialectal languages, you should consider fine-tuning existing models like MARBERT but anticipate the need for significant dataset curation and model refinement to achieve higher accuracy. Focus on capturing dialectal nuances and expanding lexical feature analysis to improve emoji prediction and broader language understanding.

Key insights

MARBERT model shows promise in predicting Arabic emoji use, but needs further refinement for dialectal nuances.

Principles

Lexical features influence emoji use.
Multidialectal languages pose ML challenges.

Method

A preprocessing pipeline classifies 8,695 Arabic tweets into 14 categories, then fine-tunes MARBERT to predict emoji use from textual input, evaluating performance with precision, recall, and F1-scores.

In practice

Use MARBERT for Arabic text analysis.
Collect dialect-specific datasets.
Classify text into word categories.

Topics

Machine Learning
Emoji Prediction
Arabic Dialects
MARBERT Model
Natural Language Processing

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.