Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
Summary
Erase-then-Delta Attention (EDA) is a novel memory update rule designed for recurrent memory models, which decouples the erase and write addresses. Unlike traditional delta-rule linear attention that corrects content at the current write address, EDA introduces an independent mechanism to actively suppress outdated memory. The method involves a two-step process: first, a targeted erase step along a learned erase direction, followed by a standard delta-style corrective write in the current write direction. This approach enhances memory management capacity while retaining the corrective benefits of delta-rule updates. Language model pretraining experiments demonstrated EDA's superior performance across dense 2.5B and Mixture-of-Experts (MoE) 25B-A2.8B model families. Its advantages were sustained even after 80B-token long-context midtraining of MoE models, showing best performance in long-context evaluations ranging from 4k to 128k contexts.
Key takeaway
For Machine Learning Engineers optimizing recurrent memory models for long-context language tasks, you should consider implementing Erase-then-Delta Attention (EDA). This method improves memory management by decoupling erase and write operations, leading to superior performance in models up to 25B parameters and contexts up to 128k tokens. Integrating EDA can enhance your model's ability to handle stale information effectively, particularly when passive memory decay is weak.
Key insights
Recurrent memory models benefit from decoupling memory erase and write operations for improved management.
Principles
- Recurrent memory models should actively suppress outdated memory.
- Decoupling erase and write addresses enhances memory management.
- Corrective writes can be preserved alongside targeted erasure.
Method
EDA first applies a targeted erase step along a learned erase direction, then performs a standard delta-style corrective write along the current write direction.
In practice
- Implement a learned erase direction for memory cleanup.
- Integrate a two-step erase-then-write memory update.
- Evaluate EDA in long-context language model tasks.
Topics
- Erase-then-Delta Attention
- Recurrent Memory Models
- Delta-Rule Attention
- Language Models
- Long-Context NLP
- Memory Management
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.