Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation
Summary
A study by Cegin et al. introduces activation steering as an effective method for generating high-quality synthetic data for low-resource languages, addressing limitations of few-shot prompting like increased inference costs and reduced diversity. The research evaluates two steering strategies: Language Steering, which focuses on linguistic identity, and Quality Steering, which distinguishes well-formed human-written text from backtranslated content. These methods were tested across four open-source LLMs—Gemma-2-9B, Gemma-2-27B, Llama-3.1-8B, and Llama-3.1-70B—and 11 typologically diverse languages for sentiment and topic classification tasks. Results indicate that applying steering vectors to early transformer layers consistently enhances the diversity of generated data and frequently improves downstream model performance, particularly for low-resource languages. Quality steering, in particular, showed superior F1 gains, improving performance in 79.54% of zero-shot cases for early layers.
Key takeaway
For Machine Learning Engineers generating synthetic data for low-resource languages, consider implementing activation steering. Applying Quality steering vectors to early transformer layers of models like Gemma-2-9B or Llama-3.1-8B can significantly boost downstream model performance and increase data diversity, especially in zero-shot settings. This approach offers an efficient alternative to few-shot prompting, reducing inference costs while improving data quality.
Key insights
Activation steering, especially Quality Steering on early layers, significantly improves synthetic data quality and diversity for low-resource languages.
Principles
- Early-layer steering is most effective.
- Quality steering outperforms linguistic identity.
- Steering enhances data diversity.
Method
Collect activations, create Language or Quality steering vectors, apply vectors to LLM during generation, then finetune and evaluate a downstream model.
In practice
- Use Quality steering for synthetic data.
- Apply steering to early LLM layers.
- Combine steering with zero-shot prompts.
Topics
- Activation Steering
- Synthetic Data Generation
- Low-Resource Languages
- Large Language Models
- Quality Steering
- Data Diversity
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.