Direct Preference Optimization Beyond Chatbots
Summary
DharmaOCR, a specialized structured OCR model released in June 2026, demonstrated that Direct Preference Optimization (DPO) effectively mitigates text degeneration in vision-language models. Benchmarking leading open-source and commercial models on Brazilian Portuguese structured document extraction, initial text degeneration rates varied from below 1% to over 33%. While supervised fine-tuning (SFT) reduced these rates for most models, it often failed to reach production-acceptable levels, indicating a structural limitation. A subsequent DPO training stage, applied after SFT, consistently reduced text degeneration across all five tested model families, achieving an average reduction of 59.4% and a peak of 87.6%. This DPO approach uniquely utilized the SFT model's own degenerate outputs as rejected examples, framing them as a direct negative training signal. This method addresses a limitation of SFT, which optimizes token-by-token and does not explicitly penalize completion-level failures like repetition loops, unlike DPO's full-output preference signal.
Key takeaway
For Machine Learning Engineers building structured generation pipelines, relying solely on supervised fine-tuning (SFT) for output reliability is insufficient. You should integrate a Direct Preference Optimization (DPO) stage after SFT to explicitly address persistent failure modes like text degeneration. Design your pipeline to capture and utilize your model's own identifiable failure outputs as rejection examples for DPO, as this one-time training investment significantly enhances output consistency without sacrificing extraction quality.
Key insights
Direct Preference Optimization (DPO) can effectively mitigate specific, objective failure modes by using a model's own failures as rejection signals.
Principles
- Supervised fine-tuning (SFT) optimizes token-by-token, not explicitly penalizing completion-level failures like repetition loops.
- Text degeneration is a systems-level failure, not purely a decoding artifact, involving the training objective and learned distribution.
- Task capability and degeneration resistance are distinct properties of a model's distribution.
Method
Generate candidate responses with an SFT model, then use an automated judge to label degenerate outputs as rejected examples and clean extractions as chosen, forming DPO preference pairs.
In practice
- Implement a DPO stage after SFT to target specific, identifiable failure modes.
- Develop automated scoring to generate preference pairs from model outputs.
- Deliberately use model's own failure outputs as DPO rejection examples.
Topics
- Direct Preference Optimization
- Text Degeneration
- Structured OCR
- Supervised Fine-tuning
- Vision-Language Models
- ML Pipelines
Best for: Computer Vision Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.