HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

The Hong Kong Judgment Discourse Dataset (HKJudge) is introduced as the first sentence-level expert-annotated legal discourse corpus, specifically for Hong Kong judgments. It encompasses criminal judgments from all five court hierarchy levels, containing approximately 290,000 sentences and 6.5 million tokens. A two-tier discourse schema assigns each sentence one of 26 rhetorical roles and annotates three sentencing elements (charge, imprisonment term, fine) at the span level. Ten legal linguistics annotators achieved an inter-annotator agreement of κ=0.8. The work formulates two tasks, rhetorical role classification and legal element extraction, providing benchmark evaluations across four BERT-based models, two open-source LLMs, and four commercial LLMs.

Key takeaway

For NLP Engineers or AI Scientists developing legal AI systems, this work highlights the critical role of expert-annotated discourse corpora. You should consider adapting the HKJudge dataset's two-tier annotation schema for your own legal text analysis projects. This approach can significantly improve the accuracy of models predicting legal judgment outcomes and extracting specific legal elements from complex court documents.

Key insights

Expert-annotated legal discourse corpora enable advanced AI understanding of court judgments.

Principles

Method

A two-tier discourse schema assigns 26 rhetorical roles at the sentence level and annotates three sentencing elements (charge, imprisonment term, fine) at the span level, applied by legal linguistics experts.

In practice

Topics

Code references

Best for: Research Scientist, NLP Engineer, AI Scientist, Legal Professional

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.