EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

EndoCoT is a novel framework designed to scale endogenous Chain-of-Thought (CoT) reasoning in diffusion models, addressing critical limitations of current Multimodal Large Language Models (MLLMs) when integrated as static text encoders. It tackles insufficient reasoning depth and invariant guidance during decoding by iteratively refining latent thought states through an iterative thought guidance module, then bridging these states to the Diffusion Transformer's (DiT) denoising process. A terminal thought grounding module ensures the reasoning trajectory remains aligned with textual supervision. Evaluated across diverse benchmarks including Maze, TSP, VSP, and Sudoku, EndoCoT achieves an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. It demonstrates 90% accuracy on Maze-32 and 95% on Sudoku-35, showing 25-40% gains over prior work on complex visual reasoning tasks.

Key takeaway

For Machine Learning Engineers developing diffusion models for complex visual reasoning or image editing, EndoCoT offers a robust solution to overcome static guidance limitations. You should consider implementing iterative latent state refinement and terminal thought grounding to enable genuine Chain-of-Thought reasoning. This approach significantly improves accuracy on tasks like Maze-32 (90%) and Sudoku-35 (95%) and allows for controllable inference-time scaling, providing a clear path to more robust and interpretable generative models.

Key insights

EndoCoT enables diffusion models to perform self-guided, iterative Chain-of-Thought reasoning by dynamically updating latent states.

Principles

MLLMs require iterative refinement for complex logical constraint encoding.
DiTs need dynamic conditioning to maintain alignment with complex logical constraints.
Joint MLLM and DiT optimization is crucial for effective spatial reasoning and grounding.

Method

EndoCoT employs iterative thought guidance to update latent states in the MLLM, conditioning the DiT at each reasoning step. Terminal thought grounding aligns the final state with textual ground truth, using a two-stage progressive training strategy.

In practice

Solve complex visual reasoning tasks like Maze, TSP, Sudoku, and VSP.
Perform progressive, step-by-step image editing with controllable operations.
Achieve controllable inference-time scaling for accuracy on complex tasks.

Topics

Diffusion Models
Chain-of-Thought Reasoning
Multimodal Large Language Models
Diffusion Transformers
Visual Reasoning
Image Editing

Code references

InternLM/EndoCoT

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.