You are throwing half of the token into the trash.
Summary
Uploading raw files like PDFs or Word documents to Large Language Models (LLMs) significantly increases token consumption and cost, as models process unnecessary "packaging" data such as page layouts, fonts, and binary structures. This overhead can account for up to half of paid tokens. Additionally, agglutinative languages like Turkish incur a "language tax" because tokenizers, primarily trained on English, break down suffixes into separate tokens, leading to approximately 1.5 tokens per Turkish word compared to 1 token per English word. To mitigate these issues, converting documents to Markdown format before ingestion is recommended. Markdown preserves structure while stripping extraneous data, resulting in fewer tokens, improved model accuracy by reducing noise, and faster response times. Tools like Microsoft's MarkItDown, IBM's Docling, and Pandoc facilitate this conversion.
Key takeaway
For AI Engineers optimizing LLM inference costs and improving model accuracy, you should prioritize converting input documents to Markdown format. This practice significantly reduces token consumption by eliminating unnecessary formatting data, especially crucial when processing agglutinative languages like Turkish. By adopting tools such as MarkItDown or Docling for conversion, you can achieve substantial savings on API calls and enhance the model's ability to focus on core content, leading to more reliable outputs.
Key insights
Directly uploading formatted files or using agglutinative languages to LLMs drastically inflates token costs and reduces processing efficiency.
Principles
- Tokenizers are optimized for English text.
- File formatting adds significant token overhead.
- Simpler input improves LLM focus and accuracy.
Method
Convert source documents (PDF, Word, Excel) to Markdown (.md) format using tools like MarkItDown, Docling, or Pandoc, then review and clean the converted Markdown for residual noise before feeding to LLMs.
In practice
- Use MarkItDown for general file conversion.
- Employ Docling for complex, table-heavy documents.
- Clean converted Markdown to remove page numbers.
Topics
- LLM Cost Optimization
- Tokenization Efficiency
- Markdown Conversion
- Document Preprocessing
- Agglutinative Languages
- MarkItDown Library
Best for: AI Engineer, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.