TW-LegalBench: Measuring Taiwanese Legal Understanding

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

TW-LegalBench is a new benchmark designed to evaluate large language models' (LLMs) understanding of the Taiwanese legal system, addressing a gap in civil-law and Traditional Chinese evaluations. It comprises three main components: over 16,000 multiple-choice questions from five years of official examinations across 18 professional domains, 117 open-ended essay questions for legal professionals with scoring rubrics, and more than 14,000 legal judgment prediction instances covering 107 crime categories. Evaluations of 13 LLMs revealed that top models like claude-sonnet-4.5 and qwen3-235b surpassed the 11% passing threshold for qualified lawyers but failed to meet the 1-2% threshold for judges and prosecutors. While models showed reasonable verdict and sentence prediction, they struggled significantly with precise statutory citation, achieving less than 10% accuracy. Notably, models trained on Traditional Chinese or Taiwan-specific legal corpora consistently outperformed larger general-purpose LLMs on open-ended tasks and judgment prediction.

Key takeaway

For AI Scientists and Machine Learning Engineers developing legal LLMs, this research highlights the critical need for specialized training data. Your models, even top-tier ones, will likely pass basic legal qualification exams but struggle with the nuanced statutory citation and complex reasoning required for judicial roles. Focus your development on incorporating extensive, jurisdiction-specific legal corpora, particularly for civil-law systems, and prioritize improving precise statutory article generation to bridge the significant performance gap with human experts.

Key insights

LLMs demonstrate proficiency in legal qualification exams but struggle with precise statutory citation and advanced reasoning in civil-law contexts.

Principles

Method

TW-LegalBench evaluates LLMs using accuracy for 16,000+ MCQs, a decomposed LLM-as-Judge framework for 117 OEQs, and metrics for sentencing accuracy and statute citation for 14,000+ LJP instances.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.