Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification
Summary
This study compares supervised BERTurk models against instruction-tuned LLMs (Llama 3.1:8b, GPT-OSS-20B, Qwen 2.5-14B) for Turkish Light Verb Construction (LVC) detection. Researchers framed LVC detection as a binary classification task, evaluating models on a manually created diagnostic dataset of 147 sentences, including LVC positives and matched literal negatives. Zero-shot LLMs exhibited a strong bias towards negative predictions, resulting in very low LVC recall. One-shot prompting improved LVC detection but introduced model-specific biases. Few-shot prompting, however, significantly enhanced calibration and yielded robust overall performance for GPT-OSS-20B and Qwen 2.5-14B, often matching or exceeding the competitive BERTurk baseline. The findings underscore substantial prompt sensitivity in Turkish metalinguistic classification.
Key takeaway
For NLP engineers developing Turkish language models, this research indicates that carefully constructed few-shot prompts are crucial for accurate Light Verb Construction (LVC) detection. Relying on zero-shot inference will likely result in high false negatives for LVCs. You should prioritize creating balanced demonstration sets to calibrate LLMs effectively, as this can achieve performance comparable to or better than fine-tuned supervised models like BERTurk for specific metalinguistic tasks.
Key insights
In-context learning for Turkish LVCs is highly prompt-sensitive, with few-shot demonstrations improving LLM calibration to match supervised baselines.
Principles
- Zero-shot LLMs exhibit conservative bias on LVCs.
- One-shot prompting can induce strong model-specific biases.
- Few-shot prompts improve LLM calibration and robustness.
Method
Turkish LVC detection is framed as binary classification. A supervised BERTurk baseline is compared to instruction-tuned LLMs (Llama 3.1:8b, GPT-OSS-20B, Qwen 2.5-14B) using zero-shot, one-shot, and few-shot prompting on a controlled 147-item diagnostic dataset.
In practice
- Use few-shot prompting for LVC classification.
- Construct prompts with balanced positive/negative examples.
- Evaluate LLMs on controlled diagnostic sets.
Topics
- Multiword Expressions
- Light Verb Constructions
- In-Context Learning
- Prompt Engineering
- Turkish NLP
- BERTurk
- LLM Evaluation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.