DODO: Discrete OCR Diffusion Models
Summary
DODO (Discrete OCR Diffusion Models) is a new Vision-Language Model (VLM) designed to overcome the computational bottlenecks of traditional autoregressive decoding in Optical Character Recognition (OCR). While VLMs achieve high OCR accuracy, their sequential token generation is slow for long documents. DODO addresses this by leveraging block discrete diffusion, which enables parallel decoding. Existing masked diffusion models (MDMs) struggle with OCR's rigid, exact-match requirements due to structural instabilities, but DODO mitigates these by decomposing generation into causally anchored blocks. This approach allows DODO to achieve near state-of-the-art accuracy while delivering up to 3x faster inference compared to autoregressive baselines, processing approximately 63 tokens/sec with KV-caching. The model's block-wise training and ability to scale block sizes to 256 tokens are crucial for its efficiency and accuracy in dense text recognition.
Key takeaway
For AI Engineers and Research Scientists developing high-throughput document processing systems, DODO demonstrates that block discrete diffusion offers a viable and efficient alternative to autoregressive models for OCR. You should consider adopting this architecture to significantly reduce inference latency for long documents, potentially tripling throughput, without sacrificing accuracy. This approach is particularly beneficial for applications requiring rapid digitization and analysis of extensive textual content.
Key insights
DODO uses block discrete diffusion to achieve faster, accurate OCR by enabling parallel token generation while maintaining structural integrity.
Principles
- OCR's deterministic nature suits parallel decoding.
- Rigid tasks like OCR require structural safety rails for diffusion models.
- Block decomposition mitigates global diffusion synchronization errors.
Method
DODO employs block discrete diffusion, processing text in sequentially conditioned blocks. It uses KV-caching and scales training block size to 256 tokens, maximizing parallel efficiency and ensuring causal consistency for faster inference.
In practice
- Use block diffusion for high-throughput OCR.
- Integrate KV-caching for accelerated inference.
- Train with block constraints for robust OCR performance.
Topics
- Optical Character Recognition
- Diffusion Models
- Vision-Language Models
- Parallel Decoding
- Block Diffusion
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.