JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Social Sciences & Behavioral Studies · Depth: Advanced, medium

Summary

JobArabi is a new, large-scale Arabic corpus of 20,528 job announcements gathered from public posts on X between January 2024 and October 2025. This dataset captures over two years of employment discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework that incorporated 21 Arabic keyword families, reflecting gendered, plural, formal, and dialectal recruitment language. It includes metadata such as timestamps, engagement indicators, and geolocation, enabling detailed temporal and regional analysis. Quantitative analysis of JobArabi revealed significant sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional differences in occupational demand, and the emotional framing of recruitment messages. This corpus, along with its documentation and collection scripts, will be publicly released to support research in Arabic NLP, computational social science, and digital labor studies.

Key takeaway

For Arabic NLP researchers or computational social scientists building language resources, you should integrate the JobArabi corpus into your projects. This dataset offers a unique opportunity to analyze real-world Arabic employment discourse, revealing sociolinguistic patterns like gendered language and regional demand. Utilizing JobArabi can enhance your models' understanding of nuanced recruitment language and provide valuable insights into labor market communication dynamics.

Key insights

JobArabi provides a unique Arabic social media corpus for analyzing labor market communication and sociolinguistic patterns.

Principles

Method

Corpus compiled using a linguistically informed query framework with 21 Arabic keyword families, capturing gendered, plural, formal, and dialectal expressions.

In practice

Topics

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.