Key-Gram: Extensible World Knowledge for Embodied Manipulation

2026-05-18 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Key-Gram is a novel conditional-memory framework designed to enhance embodied control by separating language-derived world knowledge from visual-state reasoning. Current vision-language-action policies often integrate linguistic knowledge and visual computation within a single backbone, leading to modality competition and making knowledge extension difficult. Key-Gram addresses this by employing a memory module that decomposes instructions into task-specific key-grams, retrieves static linguistic priors via deterministic hashed lookup, and injects these entries into selected hidden layers using context-aware gating and convolutional fusion. This architecture allows the main backbone to focus on visual reasoning and action inference, while instruction knowledge resides in an extensible external memory. The logical memory table can be partitioned during training and efficiently placed on host memory during inference due to its O(1) lookup. Key-Gram achieved average relative gains of 29.5%/9.9% on RoboTwin2.0, 35.8%/4.5% on LIBERO-Plus transfer, and 15.4%/8.1% on real-world dual-arm manipulation tasks.

Key takeaway

For research scientists developing embodied AI systems, Key-Gram offers a compelling architectural shift to improve compositional grounding and transferability. You should consider externalizing linguistic knowledge into a separate, extensible memory module to enhance visual reasoning capacity and simplify knowledge updates, especially for long-horizon and real-world manipulation tasks.

Key insights

Separating linguistic knowledge from visual reasoning improves embodied control and knowledge extensibility.

Principles

Decouple language priors from visual computation.
Use external memory for extensible knowledge storage.

Method

Decompose instructions into key-grams, retrieve linguistic priors via hashed lookup, and inject them into hidden layers using context-aware gating and convolutional fusion.

In practice

Implement O(1) lookup for efficient memory access.
Partition memory tables during training for flexibility.

Topics

Key-Gram
Embodied Control
Conditional Memory Framework
Linguistic Priors
RoboTwin2.0

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.