Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Summary
This work addresses the underexplored ability of large language models (LLMs) to capture cross-sentence paradigmatic patterns, specifically verb alternations. The authors introduce curated, paradigm-based datasets for English, German, Italian, and Hebrew, focusing on phenomena like change-of-state and object-drop constructions, and Hebrew binyanim. These datasets consist of thousands of Blackbird Language Matrices (BLMs) problems, a controlled linguistic puzzle designed to test syntactic and semantic rule completion. The research also presents three types of templates and applies linguistically-informed data augmentation strategies to both synthetic and natural data. Simple baseline performance results are provided, demonstrating the diagnostic utility of these new datasets for evaluating LLMs' systematic cross-sentence knowledge.
Key takeaway
This work introduces curated, paradigm-based datasets for English, German, Italian, and Hebrew to diagnose Large Language Models' (LLMs) ability to capture cross-sentence verb alternations. These datasets leverage thousands of Blackbird Language Matrices (BLMs) problems, an RPM/ARC-like linguistic puzzle, incorporating three template types and linguistically-informed data augmentation. Baseline results confirm their diagnostic utility for evaluating LLM understanding of complex linguistic patterns beyond single-sentence phenomena.
Topics
- Large Language Models
- Verb Alternations
- Linguistic Datasets
- Data Augmentation
- Cross-sentence Patterns
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.