IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Financial Regulatory AI · Depth: Expert, quick

Summary

IndiaFinBench is introduced as the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. This benchmark addresses a gap in existing financial NLP benchmarks, which primarily use Western financial corpora. IndiaFinBench comprises 406 expert-annotated question-answer pairs derived from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). It covers four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality was validated with a model-based secondary pass (kappa=0.918) and human inter-annotator agreement (kappa=0.611; 76.7% agreement). Twelve models were evaluated under zero-shot conditions, with accuracies ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash), all significantly outperforming a 60.0% human baseline. Numerical reasoning proved the most discriminative task.

Key takeaway

For AI Engineers and Research Scientists developing LLMs for financial applications in India, IndiaFinBench offers a critical tool to assess and improve model performance. Your models should be evaluated against this benchmark to ensure accuracy in regulatory interpretation, numerical reasoning, and contradiction detection specific to Indian financial contexts. The dataset's availability allows for targeted fine-tuning, potentially leading to more robust and compliant financial AI solutions.

Key insights

IndiaFinBench provides the first public benchmark for LLMs on Indian financial regulatory text.

Principles

Method

The benchmark was created by expert-annotating 406 Q&A pairs from SEBI/RBI documents across four task types, then validating annotations via model-based and human agreement checks.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.