GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

GLASS is a novel framework designed for composable acoustic style control within zero-shot autoregressive text-to-speech (TTS) systems. It addresses the challenge of disentangling speaker identity from prosodic attributes like speaking rate and pitch, which are often entangled in traditional speaker prompts. Instead of relying on style labels, GLASS learns control directions for each acoustic attribute using post-generation rewards. The framework freezes the main TTS backbone and trains individual, lightweight LoRA adapters with Group Relative Policy Optimization (GRPO). These adapters utilize speech-token length and mean F0 as style rewards, while maintaining intelligibility via a WER anchor. A key innovation is that independently trained LoRA adapters can be swapped, interpolated, and composed through linear LoRA arithmetic, eliminating the need to retrain the entire backbone. Experiments confirm its ability to achieve targeted style shifts for speaking rate and pitch, preserving naturalness, speaker similarity, and intelligibility, alongside demonstrating smooth interpolation and multi-axis composition.

Key takeaway

For Machine Learning Engineers developing zero-shot text-to-speech systems, GLASS provides a robust solution for fine-grained acoustic style control. If you need to adjust prosodic attributes like speaking rate or pitch without altering speaker identity or retraining the entire model, you should explore implementing GRPO-trained LoRA adapters. This approach allows for composable style manipulation through simple linear arithmetic, significantly streamlining the development of highly customizable and natural-sounding synthetic voices.

Key insights

GLASS uses reward-trained LoRA adapters for composable, disentangled acoustic style control in zero-shot TTS.

Principles

Method

Freeze TTS backbone; train LoRA adapters per attribute using GRPO. Use speech-token length/mean F0 as style rewards and WER for intelligibility.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.