Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification
Summary
Turkish idiomatic light verb constructions (LVCs) pose a challenge for multiword expression processing due to their surface form similarity with literal verb-object combinations. This study frames Turkish LVC detection as a binary classification task, evaluating it on a manually created controlled dataset (N=147) comprising LVC positives and matched negatives. It compares a supervised Turkish encoder baseline (BERTurk with a classifier head) against three instruction-tuned LLMs using zero-shot, one-shot, and few-shot prompting. Zero-shot LLMs showed low LVC recall, while one-shot prompting improved detection but introduced model-specific biases. A richer few-shot prompt enhanced calibration and yielded robust performance for GPT-OSS-20B and Qwen 2.5-14B. The results underscore significant prompt sensitivity, indicating that the supervised baseline remains competitive, and prompted LLMs can match or exceed it with carefully constructed demonstrations.
Key takeaway
For NLP Engineers developing multiword expression classifiers, carefully evaluate LLM prompting strategies. While zero-shot may underperform on specific classes, one-shot can introduce strong model-specific biases. Prioritize richer few-shot demonstrations to achieve calibrated and robust performance, especially when working with models like GPT-OSS-20B or Qwen 2.5-14B. You should also benchmark your LLM solutions against strong supervised baselines to ensure competitive results.
Key insights
Prompt sensitivity and demonstration quality critically impact LLM performance in metalinguistic classification tasks.
Principles
- One-shot prompting can induce strong, model-specific biases.
- Richer few-shot prompts improve LLM calibration and robustness.
- Supervised baselines remain competitive against prompted LLMs.
Method
The study frames Turkish LVC detection as binary classification, comparing a supervised BERTurk model to instruction-tuned LLMs using zero-shot, one-shot, and few-shot prompting on a controlled dataset.
In practice
- Employ few-shot prompting for robust LLM performance.
- Carefully construct demonstrations to mitigate model biases.
- Benchmark LLM solutions against strong supervised baselines.
Topics
- Multiword Expressions
- Light Verb Constructions
- In-Context Learning
- Large Language Models
- Prompt Engineering
- Turkish NLP
- Binary Classification
Best for: AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.