Stop Returning Flat Text from a PDF: The Relational Tables RAG Needs

2026-06-11 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, extended

Summary

This article describes a PDF parsing approach for RAG systems that models documents as a relational set of tables instead of flat text. It details eight specific DataFrames: toc_df, page_df, line_df, image_df, span_df, object_registry, cross_ref_df, and parsing_summary. These tables capture document structure, content, typography, images, and cross-references, enabling downstream RAG components (retrieval, generation, highlighting) to query structured data. The parse_pdf function generates these linked tables, demonstrated on a 15-page LaTeX research paper ("Attention Is All You Need") and a 32-page NIST Cybersecurity Framework 2.0 document, showing consistent output structure. This method replaces costly re-parsing with efficient DataFrame queries, improving RAG pipeline performance and accuracy, especially for complex documents like contracts with tables.

Key takeaway

For AI Engineers building RAG pipelines for complex enterprise documents, you must move beyond simple text extraction. Implement a relational parsing approach that generates linked DataFrames like line_df and toc_df. This preserves critical structural and semantic information, preventing issues like lost table context. By caching these parsed tables, you significantly reduce re-parsing costs and enable more accurate, context-aware retrieval and generation, improving overall RAG system performance.

Key insights

PDF parsing for RAG should produce a relational set of linked tables, not flat text, to preserve document structure.

Principles

Model documents as relational tables.
line_df is the central source of truth.
Semantic signals improve retrieval.

Method

The parse_pdf function generates eight linked DataFrames (toc_df, page_df, line_df, image_df, span_df, object_registry, cross_ref_df, parsing_summary) from a PDF, enabling structured queries.

In practice

Use toc_df for section-level retrieval.
Filter line_df by column_position for tables.
Enrich image_df with vision LLM descriptions.

Topics

RAG Systems
PDF Parsing
Document Intelligence
Relational Data Models
Information Extraction
DataFrames

Best for: AI Engineer, MLOps Engineer, Data Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.