Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A study by Cegin et al. introduces activation steering as an effective method for generating high-quality synthetic data for low-resource languages, addressing limitations of few-shot prompting like increased inference costs and reduced diversity. The research evaluates two steering strategies: Language Steering, which focuses on linguistic identity, and Quality Steering, which distinguishes well-formed human-written text from backtranslated content. These methods were tested across four open-source LLMs—Gemma-2-9B, Gemma-2-27B, Llama-3.1-8B, and Llama-3.1-70B—and 11 typologically diverse languages for sentiment and topic classification tasks. Results indicate that applying steering vectors to early transformer layers consistently enhances the diversity of generated data and frequently improves downstream model performance, particularly for low-resource languages. Quality steering, in particular, showed superior F1 gains, improving performance in 79.54% of zero-shot cases for early layers.

Key takeaway

For Machine Learning Engineers generating synthetic data for low-resource languages, consider implementing activation steering. Applying Quality steering vectors to early transformer layers of models like Gemma-2-9B or Llama-3.1-8B can significantly boost downstream model performance and increase data diversity, especially in zero-shot settings. This approach offers an efficient alternative to few-shot prompting, reducing inference costs while improving data quality.

Key insights

Activation steering, especially Quality Steering on early layers, significantly improves synthetic data quality and diversity for low-resource languages.

Principles

Early-layer steering is most effective.
Quality steering outperforms linguistic identity.
Steering enhances data diversity.

Method

Collect activations, create Language or Quality steering vectors, apply vectors to LLM during generation, then finetune and evaluate a downstream model.

In practice

Use Quality steering for synthetic data.
Apply steering to early LLM layers.
Combine steering with zero-shot prompts.

Topics

Activation Steering
Synthetic Data Generation
Low-Resource Languages
Large Language Models
Quality Steering
Data Diversity

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.