Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

This article details a method for reconstructing a PDF's table of contents ("toc_df") when the document lacks a native outline but includes a printed contents page. This reconstruction is vital for Retrieval Augmented Generation (RAG) systems, enabling section-scoped retrieval, chunking, and summarization. The proposed approach uses a cascade of three methods, prioritizing cost-effectiveness. Case 1 leverages the PDF's native outline via "doc.get_toc()". Case 2 identifies contents pages with hyperlinks, extracting titles and physical target pages directly using PyMuPDF's "page.get_links()" and a link density check (e.g., 5+ internal links). Case 3 addresses printed contents pages without links, first detecting entries by dot-leader density and regex patterns, then aligning the printed page labels ("displayed_page") to actual physical "start_page" numbers, often by inferring a constant offset. An LLM is employed for coherence checking of the reconstructed "toc_df", rather than initial detection, ensuring a uniform output for downstream RAG processes.

Key takeaway

For AI Engineers building RAG systems, accurately parsing PDF structure is critical. If your documents lack native outlines, implement a cascade that first checks for clickable contents pages, then reads printed tables of contents. Always align printed page labels to physical document pages to prevent retrieval errors. This ensures your RAG system scopes answers precisely by section, improving relevance and reducing hallucination risks.

Key insights

Reconstruct PDF table of contents for RAG by prioritizing native outlines, then links, then printed text with page alignment.

Principles

Method

A cascade attempts native outline, then hyperlink-based extraction (density check, "page.get_links()"). If those fail, it reads printed contents via dot-leader patterns and regex, then aligns printed labels to physical pages using offset inference or content matching.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.