Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

This study compares supervised BERTurk models against instruction-tuned LLMs (Llama 3.1:8b, GPT-OSS-20B, Qwen 2.5-14B) for Turkish Light Verb Construction (LVC) detection. Researchers framed LVC detection as a binary classification task, evaluating models on a manually created diagnostic dataset of 147 sentences, including LVC positives and matched literal negatives. Zero-shot LLMs exhibited a strong bias towards negative predictions, resulting in very low LVC recall. One-shot prompting improved LVC detection but introduced model-specific biases. Few-shot prompting, however, significantly enhanced calibration and yielded robust overall performance for GPT-OSS-20B and Qwen 2.5-14B, often matching or exceeding the competitive BERTurk baseline. The findings underscore substantial prompt sensitivity in Turkish metalinguistic classification.

Key takeaway

For NLP engineers developing Turkish language models, this research indicates that carefully constructed few-shot prompts are crucial for accurate Light Verb Construction (LVC) detection. Relying on zero-shot inference will likely result in high false negatives for LVCs. You should prioritize creating balanced demonstration sets to calibrate LLMs effectively, as this can achieve performance comparable to or better than fine-tuned supervised models like BERTurk for specific metalinguistic tasks.

Key insights

In-context learning for Turkish LVCs is highly prompt-sensitive, with few-shot demonstrations improving LLM calibration to match supervised baselines.

Principles

Method

Turkish LVC detection is framed as binary classification. A supervised BERTurk baseline is compared to instruction-tuned LLMs (Llama 3.1:8b, GPT-OSS-20B, Qwen 2.5-14B) using zero-shot, one-shot, and few-shot prompting on a controlled 147-item diagnostic dataset.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.