Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Turkish idiomatic light verb constructions (LVCs) pose a challenge for multiword expression processing due to their surface form similarity with literal verb-object combinations. This study frames Turkish LVC detection as a binary classification task, evaluating it on a manually created controlled dataset (N=147) comprising LVC positives and matched negatives. It compares a supervised Turkish encoder baseline (BERTurk with a classifier head) against three instruction-tuned LLMs using zero-shot, one-shot, and few-shot prompting. Zero-shot LLMs showed low LVC recall, while one-shot prompting improved detection but introduced model-specific biases. A richer few-shot prompt enhanced calibration and yielded robust performance for GPT-OSS-20B and Qwen 2.5-14B. The results underscore significant prompt sensitivity, indicating that the supervised baseline remains competitive, and prompted LLMs can match or exceed it with carefully constructed demonstrations.

Key takeaway

For NLP Engineers developing multiword expression classifiers, carefully evaluate LLM prompting strategies. While zero-shot may underperform on specific classes, one-shot can introduce strong model-specific biases. Prioritize richer few-shot demonstrations to achieve calibrated and robust performance, especially when working with models like GPT-OSS-20B or Qwen 2.5-14B. You should also benchmark your LLM solutions against strong supervised baselines to ensure competitive results.

Key insights

Prompt sensitivity and demonstration quality critically impact LLM performance in metalinguistic classification tasks.

Principles

One-shot prompting can induce strong, model-specific biases.
Richer few-shot prompts improve LLM calibration and robustness.
Supervised baselines remain competitive against prompted LLMs.

Method

The study frames Turkish LVC detection as binary classification, comparing a supervised BERTurk model to instruction-tuned LLMs using zero-shot, one-shot, and few-shot prompting on a controlled dataset.

In practice

Employ few-shot prompting for robust LLM performance.
Carefully construct demonstrations to mitigate model biases.
Benchmark LLM solutions against strong supervised baselines.

Topics

Multiword Expressions
Light Verb Constructions
In-Context Learning
Large Language Models
Prompt Engineering
Turkish NLP
Binary Classification

Best for: AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.