I Found 221 Bugs in vLLM. They All Had the Same Root Cause
Summary
An audit of vLLM, a widely deployed open-source inference engine for large language models, revealed 221 instances of silent integer truncation vulnerabilities across its C++ and CUDA codebase. These truncations occur when PyTorch's `int64_t` tensor dimension return values are assigned to 32-bit `int` variables, discarding the upper 32 bits without warning. This can lead to GPU buffer overflows, as demonstrated when a crafted GGUF model file with a dimension value like 4,294,968,321 (2^32 + 513) causes an undersized buffer allocation. The issue is particularly exploitable in GGUF dequantization kernels, where tensor dimensions originate directly from the model file. Similar vulnerabilities have resulted in 10 CVEs in other GGUF-parsing inference engines like llama.cpp and Ollama, highlighting a recognized threat model for malicious model files.
Key takeaway
For CTOs and VPs of Engineering overseeing ML inference infrastructure, you must recognize model files as untrusted input. Your teams should immediately audit C++/CUDA codebases for silent integer truncations, particularly where PyTorch's `int64_t` tensor dimensions are cast to `int`. Implement explicit bounds checks or use `int64_t` consistently to prevent GPU buffer overflows and mitigate the risk of remote code execution via malicious model files, a threat already proven in other popular inference engines.
Key insights
Silent integer truncation in ML inference engines creates a critical, unaddressed vulnerability class via crafted model files.
Principles
- Model files are untrusted input.
- Validate all values read from model files.
- Memory corruption via crafted files is a recognized threat.
Method
Replace `int` with `int64_t` for tensor dimension variables in C++/CUDA code, or add explicit bounds checks (`TORCH_CHECK`) when 32-bit integers are required, to prevent silent truncation.
In practice
- Audit C++/CUDA code for `int64_t` to `int` assignments.
- Implement explicit bounds checks for tensor dimensions.
- Treat all model file metadata as untrusted input.
Topics
- vLLM Security
- Integer Truncation
- GPU Buffer Overflow
- GGUF Model Format
- ML Inference Engines
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HackerNoon.