SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
Summary
Sahm is the first Arabic financial NLP benchmark designed to evaluate Large Language Models (LLMs) on tasks combining modern finance and Islamic jurisprudence. It comprises 14,380 expert-verified instances across seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning. The benchmark addresses a significant gap in Arabic financial NLP, given the region's 422 million Arabic speakers and a $4.9 trillion Gulf sovereign wealth fund, alongside a $4-5 trillion Islamic finance industry. Evaluation of 20 LLMs revealed that Arabic fluency does not guarantee financial reasoning, with models scoring up to 91% on recognition tasks but dropping sharply on generation, particularly for event-cause reasoning (1.89–9.84/10). Fine-tuning on Sahm significantly improved performance, with domain-adapted 7-8B models surpassing GPT-5 on some financial reasoning tasks and matching 72B open-source baselines.
Key takeaway
For AI Engineers developing financial NLP solutions for Arabic-speaking markets, integrating the Sahm benchmark is crucial. Your models, even those with strong Arabic fluency, will likely underperform on complex financial and Shari'ah-compliant reasoning tasks without specific domain adaptation. Prioritize fine-tuning on datasets like Sahm to achieve competitive performance against frontier models and ensure the reliability of Arabic financial assistants, especially for generative tasks where models tend to be verbose due to uncertainty.
Key insights
Arabic LLMs require specialized financial and Shari'ah-compliant reasoning benchmarks, as fluency alone does not ensure domain competence.
Principles
- Arabic fluency does not imply financial reasoning.
- Domain adaptation rivals scale for Arabic financial NLP.
- Recognition and generation tasks tap different competencies.
Method
Sahm constructs a benchmark using a hybrid LLM-human pipeline for data generation and expert verification, ensuring linguistic accuracy and legal fidelity across diverse financial and juristic sources.
In practice
- Fine-tune Arabic LLMs on Sahm to improve financial reasoning.
- Distinguish recognition vs. generation in model evaluation.
- Use response length as a signal for answer confidence.
Topics
- Sahm Benchmark
- Arabic Financial NLP
- Islamic Finance
- Shari'ah Reasoning
- LLM Evaluation
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.