Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

2026-03-16 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

This work addresses the underexplored ability of large language models (LLMs) to capture cross-sentence paradigmatic patterns, specifically verb alternations. The authors introduce curated, paradigm-based datasets for English, German, Italian, and Hebrew, focusing on phenomena like change-of-state and object-drop constructions, and Hebrew binyanim. These datasets consist of thousands of Blackbird Language Matrices (BLMs) problems, a controlled linguistic puzzle designed to test syntactic and semantic rule completion. The research also presents three types of templates and applies linguistically-informed data augmentation strategies to both synthetic and natural data. Simple baseline performance results are provided, demonstrating the diagnostic utility of these new datasets for evaluating LLMs' systematic cross-sentence knowledge.

Key takeaway

This work introduces curated, paradigm-based datasets for English, German, Italian, and Hebrew to diagnose Large Language Models' (LLMs) ability to capture cross-sentence verb alternations. These datasets leverage thousands of Blackbird Language Matrices (BLMs) problems, an RPM/ARC-like linguistic puzzle, incorporating three template types and linguistically-informed data augmentation. Baseline results confirm their diagnostic utility for evaluating LLM understanding of complex linguistic patterns beyond single-sentence phenomena.

Topics

Large Language Models
Verb Alternations
Linguistic Datasets
Data Augmentation
Cross-sentence Patterns

Code references

lingo-iitgn/awesome-code-mixing

Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.