FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact Match

2026-04-20 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

AMD has introduced FLy, a novel training-free method for speculative decoding (SPD) that accelerates large language model (LLM) inference by accepting semantically valid token continuations, moving beyond the rigid exact-match verification of traditional SPD. Existing methods are bottlenecked by discarding semantically correct drafts or suffer from out-of-distribution (OOD) degradation with training-based approaches. FLy leverages the target model's inherent self-corrective behavior through a two-tier mechanism: an entropy-level gate and a token-level deferred window. This allows it to distinguish between genuine errors and semantically equivalent alternatives. Implemented on AMD Instinct MI355X GPUs using the ROCm software stack, FLy achieves significant speedups, such as 2.74x for Llama-3.1-70B-Instruct and 4.80x for the 405B variant, while preserving over 99% accuracy.

Key takeaway

For NLP engineers and research scientists optimizing LLM inference, FLy offers a compelling training-free solution to boost performance on AMD ROCm-enabled systems. You should consider integrating FLy to achieve substantial speedups, particularly with larger models, without compromising accuracy or incurring the costs and OOD risks associated with training-based loose decoding methods. This approach allows for more efficient resource utilization and faster deployment of LLM applications.

Key insights

FLy accelerates LLM inference by accepting semantically valid drafts, overcoming exact-match limitations without training.

Principles

LLMs self-correct genuine errors but tolerate semantically valid alternatives.
Entropy can gate deterministic vs. ambiguous token predictions.

Method

FLy uses a two-tier mechanism: an entropy gate rejects confident mismatches, while a token-level deferred window (e.g., 6 tokens) monitors for subsequent divergence to confirm semantic validity.

In practice

Integrates with arbitrary draft-target pairs.
Achieves speedups on AMD Instinct MI355X GPUs.
Maintains >99% accuracy preservation.

Topics

Speculative Decoding
Large Language Models
Semantic Verification
Training-Free Algorithms
AMD ROCm

Code references

AMD-AGI/FLy

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.