TritonSigmoid: A fast, padding-aware sigmoid attention kernel for GPUs [R]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Life Sciences & Biology · Depth: Expert, quick

Summary

TritonSigmoid is an open-source, fast, and padding-aware sigmoid attention kernel designed for GPUs, specifically developed for single-cell foundation models. Unlike softmax, which forces competition among tokens, sigmoid attention allows models to attend strongly to multiple genes (tokens) simultaneously, crucial for sequences where cells express 200 to 16,000+ genes. The kernel natively handles variable-length padding, avoiding wasted compute on empty positions. Experiments show TritonSigmoid achieves up to 515 TFLOPS on H100 GPUs, outperforming FlashAttention-2 (361 TFLOPS) and FlashSigmoid (440 TFLOPS). It also demonstrated lower validation loss across six datasets, 25% better cell-type separation, and stable training where softmax attention diverged catastrophically.

Key takeaway

For AI Engineers developing models with variable sequence lengths, especially in genomics or other domains where multiple features can be simultaneously relevant, TritonSigmoid offers significant performance and stability advantages over traditional softmax attention. Its native padding awareness and `torch.compile` integration streamline development and improve training outcomes, even with a potential memory overhead compared to packed approaches. Consider integrating this kernel to enhance model accuracy and training robustness.

Key insights

TritonSigmoid offers superior performance and stability for variable-length sequence attention, especially in biological modeling.

Principles

Method

The kernel uses blockwise compute, similar to FlashAttention, and handles variable lengths by padding to max length and skipping fully padded blocks to maximize `torch.compile` integration.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.