Sessa: Selective State Space Attention
Summary
Liubomyr Horbatko introduces Sessa, a new decoder architecture that integrates attention mechanisms within a feedback path, enabling recurrent multi-path aggregation. Unlike traditional Transformers, which suffer from diluted token influence in diffuse attention, or selective state-space models like Mamba, which exhibit exponential decay of long-range sensitivity, Sessa maintains a power-law memory tail of order $O(\ell^{-\beta})$ for $0<\beta<1$. This asymptotic rate is slower than $1/\ell$ and is tight in diffuse uniform-routing scenarios. Sessa demonstrates flexible selective retrieval, including non-decaying profiles, and achieves superior performance on long-context benchmarks while remaining competitive with Transformer and Mamba baselines on short-context language modeling, under matched architectures and training budgets.
Key takeaway
For research scientists developing sequence models, Sessa offers a compelling alternative to Transformers and Mamba, particularly for tasks requiring robust long-range context understanding. Its ability to maintain a power-law memory tail and achieve flexible selective retrieval suggests it can overcome limitations of existing architectures in handling extensive dependencies. You should evaluate Sessa for applications where sustained long-term memory and efficient processing are critical.
Key insights
Sessa integrates attention into a recurrent feedback path, achieving superior long-range memory compared to Transformers and Mamba.
Principles
- Attention within feedback paths improves long-range memory.
- Power-law memory decay ($O(\ell^{-\beta})$) outperforms $O(1/\ell)$.
Method
Sessa places attention inside a feedback path to enable recurrent many-path aggregation within a layer, facilitating flexible selective retrieval.
In practice
- Use Sessa for long-context language modeling tasks.
- Consider Sessa for flexible selective retrieval needs.
Topics
- Sessa Decoder
- Selective State Space Models
- Transformer Architecture
- Power-Law Memory Tail
- Recurrent Attention
Code references
- xiaojieli0903/Mamba-FSCIL
- IntelLabs/Hardware-Aware-Automated-Machine-Learning
- furiosa-ai/ssm-state-tuning
- swagshaw/XLSR-Mamba
- facebookresearch/TimeSformer
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.