SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Financial Natural Language Processing · Depth: Advanced, quick

Summary

Researchers have introduced SAHM, a new document-grounded benchmark and instruction-tuning dataset designed for Arabic financial Natural Language Processing (NLP) and Shari'ah-compliant reasoning. This benchmark addresses the comparative lack of resources for Arabic financial NLP despite significant demand for reliable finance and Islamic-finance AI assistants. SAHM comprises 14,380 expert-verified instances across seven distinct tasks, including AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning. The data is sourced from authentic regulatory, juristic, and corporate documents. Evaluations of 19 large language models (LLMs), both open and proprietary, revealed that while models may exhibit Arabic fluency, this does not consistently translate to strong evidence-grounded financial reasoning, with significant performance gaps observed in generation and causal reasoning tasks, particularly event-cause reasoning.

Key takeaway

For research scientists developing Arabic large language models, you should prioritize improving evidence-grounded financial and Shari'ah-compliant reasoning capabilities, particularly for generative and causal tasks. The SAHM benchmark offers a critical tool for evaluating and instruction-tuning models, highlighting that mere linguistic fluency is insufficient for complex financial applications. Focus your efforts on enhancing model performance in areas like event-cause reasoning to build truly trustworthy financial AI assistants.

Key insights

Arabic LLMs struggle with evidence-grounded financial reasoning despite fluency, especially in generation and causal tasks.

Principles

Arabic fluency does not guarantee financial reasoning.
Recognition tasks are easier for LLMs than generation.

Method

SAHM curates 14,380 expert-verified instances from regulatory, juristic, and corporate sources, spanning seven tasks, to benchmark Arabic financial NLP and Shari'ah-compliant reasoning.

In practice

Use SAHM for Arabic financial NLP evaluation.
Focus LLM development on causal reasoning.

Topics

SAHM Benchmark
Arabic Financial NLP
Shari'ah Reasoning
Large Language Models
Financial Question Answering

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.