Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Erase-then-Delta Attention (EDA) is a novel memory update rule designed for recurrent memory models, which decouples the erase and write addresses. Unlike traditional delta-rule linear attention that corrects content at the current write address, EDA introduces an independent mechanism to actively suppress outdated memory. The method involves a two-step process: first, a targeted erase step along a learned erase direction, followed by a standard delta-style corrective write in the current write direction. This approach enhances memory management capacity while retaining the corrective benefits of delta-rule updates. Language model pretraining experiments demonstrated EDA's superior performance across dense 2.5B and Mixture-of-Experts (MoE) 25B-A2.8B model families. Its advantages were sustained even after 80B-token long-context midtraining of MoE models, showing best performance in long-context evaluations ranging from 4k to 128k contexts.

Key takeaway

For Machine Learning Engineers optimizing recurrent memory models for long-context language tasks, you should consider implementing Erase-then-Delta Attention (EDA). This method improves memory management by decoupling erase and write operations, leading to superior performance in models up to 25B parameters and contexts up to 128k tokens. Integrating EDA can enhance your model's ability to handle stale information effectively, particularly when passive memory decay is weak.

Key insights

Recurrent memory models benefit from decoupling memory erase and write operations for improved management.

Principles

Recurrent memory models should actively suppress outdated memory.
Decoupling erase and write addresses enhances memory management.
Corrective writes can be preserved alongside targeted erasure.

Method

EDA first applies a targeted erase step along a learned erase direction, then performs a standard delta-style corrective write along the current write direction.

In practice

Implement a learned erase direction for memory cleanup.
Integrate a two-step erase-then-write memory update.
Evaluate EDA in long-context language model tasks.

Topics

Erase-then-Delta Attention
Recurrent Memory Models
Delta-Rule Attention
Language Models
Long-Context NLP
Memory Management

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.