TW-LegalBench: Measuring Taiwanese Legal Understanding
Summary
TW-LegalBench is a new benchmark designed to evaluate large language models' (LLMs) understanding of the Taiwanese legal system, addressing a gap in civil-law and Traditional Chinese evaluations. It comprises three main components: over 16,000 multiple-choice questions from five years of official examinations across 18 professional domains, 117 open-ended essay questions for legal professionals with scoring rubrics, and more than 14,000 legal judgment prediction instances covering 107 crime categories. Evaluations of 13 LLMs revealed that top models like claude-sonnet-4.5 and qwen3-235b surpassed the 11% passing threshold for qualified lawyers but failed to meet the 1-2% threshold for judges and prosecutors. While models showed reasonable verdict and sentence prediction, they struggled significantly with precise statutory citation, achieving less than 10% accuracy. Notably, models trained on Traditional Chinese or Taiwan-specific legal corpora consistently outperformed larger general-purpose LLMs on open-ended tasks and judgment prediction.
Key takeaway
For AI Scientists and Machine Learning Engineers developing legal LLMs, this research highlights the critical need for specialized training data. Your models, even top-tier ones, will likely pass basic legal qualification exams but struggle with the nuanced statutory citation and complex reasoning required for judicial roles. Focus your development on incorporating extensive, jurisdiction-specific legal corpora, particularly for civil-law systems, and prioritize improving precise statutory article generation to bridge the significant performance gap with human experts.
Key insights
LLMs demonstrate proficiency in legal qualification exams but struggle with precise statutory citation and advanced reasoning in civil-law contexts.
Principles
- Jurisdiction-specific data improves LLM legal performance.
- Civil-law reasoning differs from common-law precedent matching.
- Open-ended legal tasks reveal deeper reasoning gaps.
Method
TW-LegalBench evaluates LLMs using accuracy for 16,000+ MCQs, a decomposed LLM-as-Judge framework for 117 OEQs, and metrics for sentencing accuracy and statute citation for 14,000+ LJP instances.
In practice
- Prioritize fine-tuning LLMs with jurisdiction-specific legal corpora.
- Develop robust methods for precise statutory citation.
- Use open-ended tasks to assess complex legal reasoning.
Topics
- Large Language Models
- Legal AI Benchmarking
- Taiwanese Law
- Civil Law Systems
- Statutory Interpretation
- Legal Judgment Prediction
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.