GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech
Summary
GLASS is a novel framework designed for composable acoustic style control within zero-shot autoregressive text-to-speech (TTS) systems. It addresses the challenge of disentangling speaker identity from prosodic attributes like speaking rate and pitch, which are often entangled in traditional speaker prompts. Instead of relying on style labels, GLASS learns control directions for each acoustic attribute using post-generation rewards. The framework freezes the main TTS backbone and trains individual, lightweight LoRA adapters with Group Relative Policy Optimization (GRPO). These adapters utilize speech-token length and mean F0 as style rewards, while maintaining intelligibility via a WER anchor. A key innovation is that independently trained LoRA adapters can be swapped, interpolated, and composed through linear LoRA arithmetic, eliminating the need to retrain the entire backbone. Experiments confirm its ability to achieve targeted style shifts for speaking rate and pitch, preserving naturalness, speaker similarity, and intelligibility, alongside demonstrating smooth interpolation and multi-axis composition.
Key takeaway
For Machine Learning Engineers developing zero-shot text-to-speech systems, GLASS provides a robust solution for fine-grained acoustic style control. If you need to adjust prosodic attributes like speaking rate or pitch without altering speaker identity or retraining the entire model, you should explore implementing GRPO-trained LoRA adapters. This approach allows for composable style manipulation through simple linear arithmetic, significantly streamlining the development of highly customizable and natural-sounding synthetic voices.
Key insights
GLASS uses reward-trained LoRA adapters for composable, disentangled acoustic style control in zero-shot TTS.
Principles
- Reward-defined control directions disentangle style from identity.
- LoRA adapters enable composable style manipulation.
- Policy optimization can learn style controls without labels.
Method
Freeze TTS backbone; train LoRA adapters per attribute using GRPO. Use speech-token length/mean F0 as style rewards and WER for intelligibility.
In practice
- Apply linear LoRA arithmetic for style interpolation.
- Compose multiple style controls without retraining.
- Control speaking rate and pitch independently.
Topics
- Zero-Shot Text-to-Speech
- Acoustic Style Control
- LoRA
- Group Relative Policy Optimization
- Prosody Control
- Speech Synthesis
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.