ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

2026-06-10 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

ECA, a novel exemplar-free Incremental Learning (IL) approach, addresses efficient continual alignment for Open-ended Image-to-Text Generation (OpenITG). It enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge, specifically when predominant visual data categories shift over time. ECA introduces continual alignment by incrementally adapting the alignment module within pre-trained Vision-Language Models (VLMs) to maintain high-quality cross-modal representations. Its three core mechanisms are a Mixture of Query (MoQ) module for task-specific query tokens, Fisher Dynamic Expansion (FeDEx) which uses a Fisher Information Matrix (FIM)-based metric to expand model structure, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge without accessing raw data. Evaluated on four new IL OpenITG benchmarks, ECA significantly mitigates catastrophic forgetting and improves IL performance.

Key takeaway

For Machine Learning Engineers developing vision-language models that process evolving image data streams, ECA offers a robust solution to catastrophic forgetting. You should consider its exemplar-free continual alignment approach, which leverages task-specific query adaptation, dynamic model expansion via FIM, and dictionary replay to efficiently preserve knowledge and adapt to new visual categories.

Key insights

ECA enables efficient continual alignment in OpenITG by adapting VLM alignment modules to evolving visual data without prior raw data.

Principles

Minimize interference with established alignment.
Dynamically expand model structure using FIM.
Retain past knowledge via embedding dictionary.

Method

ECA employs a Mixture of Query (MoQ) for task-specific query tokens, Fisher Dynamic Expansion (FeDEx) for FIM-based structural growth, and Dictionary Replay (DR) with an embedding dictionary to preserve past knowledge.

In practice

Adapt query tokens for new tasks via MoQ.
Expand model capacity dynamically with FeDEx.
Utilize Dictionary Replay for knowledge retention.

Topics

Open-ended Image-to-Text Generation
Incremental Learning
Continual Alignment
Vision-Language Models
Catastrophic Forgetting
Fisher Information Matrix
Mixture of Query

Code references

Snowball0823/ECA

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.