You are throwing half of the token into the trash.

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

Uploading raw files like PDFs or Word documents to Large Language Models (LLMs) significantly increases token consumption and cost, as models process unnecessary "packaging" data such as page layouts, fonts, and binary structures. This overhead can account for up to half of paid tokens. Additionally, agglutinative languages like Turkish incur a "language tax" because tokenizers, primarily trained on English, break down suffixes into separate tokens, leading to approximately 1.5 tokens per Turkish word compared to 1 token per English word. To mitigate these issues, converting documents to Markdown format before ingestion is recommended. Markdown preserves structure while stripping extraneous data, resulting in fewer tokens, improved model accuracy by reducing noise, and faster response times. Tools like Microsoft's MarkItDown, IBM's Docling, and Pandoc facilitate this conversion.

Key takeaway

For AI Engineers optimizing LLM inference costs and improving model accuracy, you should prioritize converting input documents to Markdown format. This practice significantly reduces token consumption by eliminating unnecessary formatting data, especially crucial when processing agglutinative languages like Turkish. By adopting tools such as MarkItDown or Docling for conversion, you can achieve substantial savings on API calls and enhance the model's ability to focus on core content, leading to more reliable outputs.

Key insights

Directly uploading formatted files or using agglutinative languages to LLMs drastically inflates token costs and reduces processing efficiency.

Principles

Method

Convert source documents (PDF, Word, Excel) to Markdown (.md) format using tools like MarkItDown, Docling, or Pandoc, then review and clean the converted Markdown for residual noise before feeding to LLMs.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.