Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section
Summary
This article details a method for reconstructing a PDF's table of contents ("toc_df") when the document lacks a native outline but includes a printed contents page. This reconstruction is vital for Retrieval Augmented Generation (RAG) systems, enabling section-scoped retrieval, chunking, and summarization. The proposed approach uses a cascade of three methods, prioritizing cost-effectiveness. Case 1 leverages the PDF's native outline via "doc.get_toc()". Case 2 identifies contents pages with hyperlinks, extracting titles and physical target pages directly using PyMuPDF's "page.get_links()" and a link density check (e.g., 5+ internal links). Case 3 addresses printed contents pages without links, first detecting entries by dot-leader density and regex patterns, then aligning the printed page labels ("displayed_page") to actual physical "start_page" numbers, often by inferring a constant offset. An LLM is employed for coherence checking of the reconstructed "toc_df", rather than initial detection, ensuring a uniform output for downstream RAG processes.
Key takeaway
For AI Engineers building RAG systems, accurately parsing PDF structure is critical. If your documents lack native outlines, implement a cascade that first checks for clickable contents pages, then reads printed tables of contents. Always align printed page labels to physical document pages to prevent retrieval errors. This ensures your RAG system scopes answers precisely by section, improving relevance and reducing hallucination risks.
Key insights
Reconstruct PDF table of contents for RAG by prioritizing native outlines, then links, then printed text with page alignment.
Principles
- Prioritize deterministic parsing methods.
- Printed page numbers are labels, not physical pages.
- LLMs validate, not detect, document structure.
Method
A cascade attempts native outline, then hyperlink-based extraction (density check, "page.get_links()"). If those fail, it reads printed contents via dot-leader patterns and regex, then aligns printed labels to physical pages using offset inference or content matching.
In practice
- Use PyMuPDF's "doc.get_toc()" first.
- Check for 5+ internal links on a page.
- Apply page shift to align printed labels.
Topics
- PDF Parsing
- RAG Systems
- Table of Contents Reconstruction
- Document Intelligence
- PyMuPDF
- Information Extraction
- LLM Coherence Check
Best for: AI Engineer, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.