FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following
Summary
FBK's submission to the IWSLT 2026 Instruction Following shared task describes SpeechLLMs for short-form and long-form speech instruction following. These models operate under constrained settings. For the short track, models achieved a strong SIFS score of 2.0708 on MCIF. In the long track, the research explored three distinct speech segmentation methods. It also introduced the HIFS score to address instability in long-form generation. Experimental results showed that a fixed 30-second segmentation strategy yielded the most robust long-form performance. This approach secured the highest HIFS score of 2.0663. Further analysis revealed hallucination primarily manifests as repetitive insertions in generated outputs, significantly impacting ASR and SSUM. However, short-form capabilities were largely retained after long-form extension.
Key takeaway
For NLP Engineers developing long-form speech instruction following systems, you should prioritize fixed 30-second segmentation. This strategy demonstrably provides the most robust performance, evidenced by the highest HIFS score of 2.0663. Be aware that hallucination often appears as repetitive insertions, impacting ASR and SSUM. You should implement specific post-processing or model fine-tuning to mitigate these repetitive outputs. This ensures higher quality and more stable long-form generation.
Key insights
Fixed 30-second segmentation provides robust long-form SpeechLLM performance, mitigating hallucination effects on ASR and SSUM.
Principles
- Long-form speech generation benefits from fixed segmentation.
- Hallucination manifests as repetitive insertions.
- Short-form capabilities are largely retained.
Method
SpeechLLMs are developed for short and long-form instruction following. Long-form generation explores three segmentation methods, with fixed 30-second segments proving most robust, and introduces the HIFS score for evaluation.
In practice
- Implement 30-second fixed segmentation for long-form speech.
- Evaluate long-form generation using the HIFS score.
- Address repetitive insertions to mitigate hallucination.
Topics
- SpeechLLMs
- Instruction Following
- Speech Segmentation
- Long-form Generation
- Hallucination Detection
- IWSLT 2026
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.