opendatalab / MinerU

2024-02-29 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, long

Summary

MinerU is a document parsing engine designed for LLM, RAG, and Agent workflows, converting various formats like PDF, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON. It features a VLM+OCR dual engine supporting 109 languages, accurate layout reconstruction for formulas (to LaTeX) and tables (to HTML), and handles complex elements such as scanned documents, handwriting, multi-column layouts, and cross-page table merging. Recent updates include the 3.4 release, which upgraded the "pipeline" backend's OCR model to PP-OCRv6, boosting accuracy by 11% on OmniDocBench v1.6 and doubling processing speed. The 3.3 release introduced an "effort" parameter for the "hybrid" backend, improving parsing speed by 35% to 220% with minimal accuracy impact, and upgraded the VLM model to MinerU2.5-Pro-2605-1.2B. Version 3.1.0 shifted to an Apache 2.0-based license and enhanced VLM capabilities for complex content, while 3.0.0 added native DOCX parsing and significant architectural upgrades for high-throughput, multi-GPU deployments.

Key takeaway

For AI Engineers or MLOps Engineers building RAG or Agent systems, MinerU offers a robust solution for document ingestion. Its high-accuracy parsing across PDF, DOCX, PPTX, and images, coupled with native integrations and multi-GPU deployment options, can significantly streamline your data preparation pipeline. Consider evaluating MinerU's "hybrid" backend with "effort=medium" for an optimal balance of speed and accuracy in production environments, especially when processing diverse document types.

Key insights

MinerU provides high-accuracy, multi-format document parsing with VLM+OCR for LLM, RAG, and Agent workflows.

Principles

Prioritize structured output for machine readability.
Combine VLM and OCR for robust multilingual parsing.
Optimize for both accuracy and deployment flexibility.

Method

MinerU employs a VLM+OCR dual engine to convert diverse document formats into structured Markdown/JSON, reconstructing layouts, formulas (LaTeX), and tables (HTML) while preserving reading order and removing extraneous elements.

In practice

Integrate with LangChain, Dify, FastGPT.
Deploy on CPU, GPU, or domestic AI chips.
Use "effort=medium" for faster hybrid parsing.

Topics

Document Parsing
Large Language Models
Retrieval-Augmented Generation
Optical Character Recognition
Visual Language Models
Multi-GPU Deployment

Code references

Best for: Machine Learning Engineer, NLP Engineer, AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.