SAM Audio - AI at Meta

· Source: ai.meta.com via Google News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, short

Summary

Meta has introduced SAM Audio, a new generative separation model that allows users to accurately separate any sound from audio or audio-visual sources using simple text, visual, or span prompts. This model operates across general sound, music, and speech, enabling tasks like isolating instruments, vocals, or speech from background noise. SAM Audio is powered by a flow-matching Diffusion Transformer and functions within a DAC-VAE latent space, facilitating high-quality joint generation of target and residual audio. It achieves beyond state-of-the-art performance for all prompting capabilities and includes PE-AV, a new open-source model bringing audio capabilities to Meta's Perception Encoder. Meta also released a first-of-its-kind open-source evaluation dataset for prompted audio separation.

Key takeaway

For research scientists developing audio processing applications, SAM Audio presents a significant advancement in sound separation. You should explore integrating its multimodal prompting capabilities to enhance precision in tasks like noise reduction or speech isolation. Consider leveraging the open-source model and evaluation dataset to benchmark your own systems or accelerate development of new audio-centric features.

Key insights

SAM Audio offers state-of-the-art sound separation using multimodal prompts across diverse audio types.

Principles

Method

SAM Audio employs a flow-matching Diffusion Transformer within a DAC-VAE latent space to jointly generate target and residual audio from mixtures, guided by text, visual, or temporal prompts.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ai.meta.com via Google News.