opendatalab / MinerU

· Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, long

Summary

MinerU is a document parsing engine designed for LLM, RAG, and Agent workflows, converting various formats like PDF, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON. It features a VLM+OCR dual engine supporting 109 languages, accurate layout reconstruction for formulas (to LaTeX) and tables (to HTML), and handles complex elements such as scanned documents, handwriting, multi-column layouts, and cross-page table merging. Recent updates include the 3.4 release, which upgraded the "pipeline" backend's OCR model to PP-OCRv6, boosting accuracy by 11% on OmniDocBench v1.6 and doubling processing speed. The 3.3 release introduced an "effort" parameter for the "hybrid" backend, improving parsing speed by 35% to 220% with minimal accuracy impact, and upgraded the VLM model to MinerU2.5-Pro-2605-1.2B. Version 3.1.0 shifted to an Apache 2.0-based license and enhanced VLM capabilities for complex content, while 3.0.0 added native DOCX parsing and significant architectural upgrades for high-throughput, multi-GPU deployments.

Key takeaway

For AI Engineers or MLOps Engineers building RAG or Agent systems, MinerU offers a robust solution for document ingestion. Its high-accuracy parsing across PDF, DOCX, PPTX, and images, coupled with native integrations and multi-GPU deployment options, can significantly streamline your data preparation pipeline. Consider evaluating MinerU's "hybrid" backend with "effort=medium" for an optimal balance of speed and accuracy in production environments, especially when processing diverse document types.

Key insights

MinerU provides high-accuracy, multi-format document parsing with VLM+OCR for LLM, RAG, and Agent workflows.

Principles

Method

MinerU employs a VLM+OCR dual engine to convert diverse document formats into structured Markdown/JSON, reconstructing layouts, formulas (LaTeX), and tables (HTML) while preserving reading order and removing extraneous elements.

In practice

Topics

Code references

Best for: Machine Learning Engineer, NLP Engineer, AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.