TW-LegalBench: Measuring Taiwanese Legal Understanding
Summary
TW-LegalBench is a new benchmark designed to evaluate large language models' (LLMs) understanding of the Taiwanese legal system, addressing a gap in jurisdiction-specific legal reasoning. It comprises three task types: over 16,000 multiple-choice questions from five years of official examinations across 18 professional domains; 117 open-ended essay questions for legal professionals with official scoring rubrics; and more than 14,000 legal judgment prediction instances covering hundreds of crime categories. Researchers evaluated 13 LLMs using accuracy for MCQs, an LLM-as-Judge framework for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Results indicate that top-performing models surpass the 11% passing threshold for qualified lawyers but fall short of the 1-2% rate for judges and prosecutors. While models show reasonable verdict and sentence prediction in LJP, they struggle with precise legal article citation, highlighting challenges in reliable legal text generation.
Key takeaway
For legal tech developers building LLM-powered tools for specific jurisdictions, you should recognize that current models, while passing lawyer-level exams, still struggle with the precision required for judge-level reasoning and exact legal article citation. Prioritize fine-tuning models on comprehensive, jurisdiction-specific legal corpora and develop robust mechanisms for accurate statute referencing. Your development efforts should focus on improving the reliability of legal text generation beyond general comprehension to meet professional standards.
Key insights
LLMs demonstrate strong Taiwanese legal exam performance but struggle with precise legal article citation and judge-level reasoning.
Principles
- LLM legal reasoning varies by jurisdiction.
- Qualification exam performance differs from professional roles.
- Exact legal citation remains a challenge for LLMs.
Method
TW-LegalBench evaluates LLMs using MCQs, open-ended essays with rubrics, and legal judgment prediction for sentencing and statute citation.
In practice
- Benchmark LLMs on jurisdiction-specific law.
- Use LLM-as-Judge for open-ended legal tasks.
- Focus LLM development on precise legal citation.
Topics
- Large Language Models
- Legal AI
- Taiwanese Law
- Benchmarking
- Legal Judgment Prediction
- Natural Language Understanding
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.