FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

FBK's submission to the IWSLT 2026 Instruction Following shared task describes SpeechLLMs for short-form and long-form speech instruction following. These models operate under constrained settings. For the short track, models achieved a strong SIFS score of 2.0708 on MCIF. In the long track, the research explored three distinct speech segmentation methods. It also introduced the HIFS score to address instability in long-form generation. Experimental results showed that a fixed 30-second segmentation strategy yielded the most robust long-form performance. This approach secured the highest HIFS score of 2.0663. Further analysis revealed hallucination primarily manifests as repetitive insertions in generated outputs, significantly impacting ASR and SSUM. However, short-form capabilities were largely retained after long-form extension.

Key takeaway

For NLP Engineers developing long-form speech instruction following systems, you should prioritize fixed 30-second segmentation. This strategy demonstrably provides the most robust performance, evidenced by the highest HIFS score of 2.0663. Be aware that hallucination often appears as repetitive insertions, impacting ASR and SSUM. You should implement specific post-processing or model fine-tuning to mitigate these repetitive outputs. This ensures higher quality and more stable long-form generation.

Key insights

Fixed 30-second segmentation provides robust long-form SpeechLLM performance, mitigating hallucination effects on ASR and SSUM.

Principles

Long-form speech generation benefits from fixed segmentation.
Hallucination manifests as repetitive insertions.
Short-form capabilities are largely retained.

Method

SpeechLLMs are developed for short and long-form instruction following. Long-form generation explores three segmentation methods, with fixed 30-second segments proving most robust, and introduces the HIFS score for evaluation.

In practice

Implement 30-second fixed segmentation for long-form speech.
Evaluate long-form generation using the HIFS score.
Address repetitive insertions to mitigate hallucination.

Topics

SpeechLLMs
Instruction Following
Speech Segmentation
Long-form Generation
Hallucination Detection
IWSLT 2026

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.