Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A study by Cegin et al. introduces activation steering as an effective method for generating high-quality synthetic data for low-resource languages, addressing limitations of few-shot prompting like increased inference costs and reduced diversity. The research evaluates two steering strategies: Language Steering, which focuses on linguistic identity, and Quality Steering, which distinguishes well-formed human-written text from backtranslated content. These methods were tested across four open-source LLMs—Gemma-2-9B, Gemma-2-27B, Llama-3.1-8B, and Llama-3.1-70B—and 11 typologically diverse languages for sentiment and topic classification tasks. Results indicate that applying steering vectors to early transformer layers consistently enhances the diversity of generated data and frequently improves downstream model performance, particularly for low-resource languages. Quality steering, in particular, showed superior F1 gains, improving performance in 79.54% of zero-shot cases for early layers.

Key takeaway

For Machine Learning Engineers generating synthetic data for low-resource languages, consider implementing activation steering. Applying Quality steering vectors to early transformer layers of models like Gemma-2-9B or Llama-3.1-8B can significantly boost downstream model performance and increase data diversity, especially in zero-shot settings. This approach offers an efficient alternative to few-shot prompting, reducing inference costs while improving data quality.

Key insights

Activation steering, especially Quality Steering on early layers, significantly improves synthetic data quality and diversity for low-resource languages.

Principles

Method

Collect activations, create Language or Quality steering vectors, apply vectors to LLM during generation, then finetune and evaluate a downstream model.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.