SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

2025-10-06 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Financial Natural Language Processing · Depth: Expert, extended

Summary

Sahm is the first Arabic financial NLP benchmark designed to evaluate Large Language Models (LLMs) on tasks combining modern finance and Islamic jurisprudence. It comprises 14,380 expert-verified instances across seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning. The benchmark addresses a significant gap in Arabic financial NLP, given the region's 422 million Arabic speakers and a $4.9 trillion Gulf sovereign wealth fund, alongside a $4-5 trillion Islamic finance industry. Evaluation of 20 LLMs revealed that Arabic fluency does not guarantee financial reasoning, with models scoring up to 91% on recognition tasks but dropping sharply on generation, particularly for event-cause reasoning (1.89–9.84/10). Fine-tuning on Sahm significantly improved performance, with domain-adapted 7-8B models surpassing GPT-5 on some financial reasoning tasks and matching 72B open-source baselines.

Key takeaway

For AI Engineers developing financial NLP solutions for Arabic-speaking markets, integrating the Sahm benchmark is crucial. Your models, even those with strong Arabic fluency, will likely underperform on complex financial and Shari'ah-compliant reasoning tasks without specific domain adaptation. Prioritize fine-tuning on datasets like Sahm to achieve competitive performance against frontier models and ensure the reliability of Arabic financial assistants, especially for generative tasks where models tend to be verbose due to uncertainty.

Key insights

Arabic LLMs require specialized financial and Shari'ah-compliant reasoning benchmarks, as fluency alone does not ensure domain competence.

Principles

Arabic fluency does not imply financial reasoning.
Domain adaptation rivals scale for Arabic financial NLP.
Recognition and generation tasks tap different competencies.

Method

Sahm constructs a benchmark using a hybrid LLM-human pipeline for data generation and expert verification, ensuring linguistic accuracy and legal fidelity across diverse financial and juristic sources.

In practice

Fine-tune Arabic LLMs on Sahm to improve financial reasoning.
Distinguish recognition vs. generation in model evaluation.
Use response length as a signal for answer confidence.

Topics

Sahm Benchmark
Arabic Financial NLP
Islamic Finance
Shari'ah Reasoning
LLM Evaluation

Code references

rania-hossam/SAHM

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.