DODO: Discrete OCR Diffusion Models

2026-02-20 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

DODO (Discrete OCR Diffusion Models) is a new Vision-Language Model (VLM) designed to overcome the computational bottlenecks of traditional autoregressive decoding in Optical Character Recognition (OCR). While VLMs achieve high OCR accuracy, their sequential token generation is slow for long documents. DODO addresses this by leveraging block discrete diffusion, which enables parallel decoding. Existing masked diffusion models (MDMs) struggle with OCR's rigid, exact-match requirements due to structural instabilities, but DODO mitigates these by decomposing generation into causally anchored blocks. This approach allows DODO to achieve near state-of-the-art accuracy while delivering up to 3x faster inference compared to autoregressive baselines, processing approximately 63 tokens/sec with KV-caching. The model's block-wise training and ability to scale block sizes to 256 tokens are crucial for its efficiency and accuracy in dense text recognition.

Key takeaway

For AI Engineers and Research Scientists developing high-throughput document processing systems, DODO demonstrates that block discrete diffusion offers a viable and efficient alternative to autoregressive models for OCR. You should consider adopting this architecture to significantly reduce inference latency for long documents, potentially tripling throughput, without sacrificing accuracy. This approach is particularly beneficial for applications requiring rapid digitization and analysis of extensive textual content.

Key insights

DODO uses block discrete diffusion to achieve faster, accurate OCR by enabling parallel token generation while maintaining structural integrity.

Principles

OCR's deterministic nature suits parallel decoding.
Rigid tasks like OCR require structural safety rails for diffusion models.
Block decomposition mitigates global diffusion synchronization errors.

Method

DODO employs block discrete diffusion, processing text in sequentially conditioned blocks. It uses KV-caching and scales training block size to 256 tokens, maximizing parallel efficiency and ensuring causal consistency for faster inference.

In practice

Use block diffusion for high-throughput OCR.
Integrate KV-caching for accelerated inference.
Train with block constraints for robust OCR performance.

Topics

Optical Character Recognition
Diffusion Models
Vision-Language Models
Parallel Decoding
Block Diffusion

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.