Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software
Summary
The CWE-Trace framework evaluates LLMs for vulnerability detection in systems software using 834 manually curated Linux kernel samples across 74 CWEs, enforcing a strict temporal split. Research found data contamination offers no measurable advantage, with 84% of contaminated samples lacking usable memorization and ~31% carrying CWE misclassification. Fine-tuning, specifically LoRA variants, primarily shifts output thresholds without altering underlying decision policies, a phenomenon termed "calibration without comprehension." Models exhibit stable, systematic failure modes (DFI -85.5 to +94.8 pp) that resist correction. The best detection score reached only 52.1% (+2.1 pp above chance), and exact CWE ranking remained below 1.3% Top-1 accuracy, indicating current LLMs lack reliable security reasoning for systems software.
Key takeaway
For AI Security Engineers evaluating LLMs for systems software vulnerability detection, you should recognize that current models, even after fine-tuning, lack reliable security reasoning. Focus on robust diagnostic metrics like DFI and HDD to assess true comprehension, rather than relying solely on benchmark scores that may reflect mere output calibration. Your efforts should prioritize developing models with genuine understanding over superficial fine-tuning.
Key insights
LLMs fine-tuned for vulnerability detection achieve "calibration without comprehension," adapting outputs without true security reasoning.
Principles
- Data contamination provides no measurable advantage for LLM vulnerability detection.
- LLM backbone directional priors dominate fine-tuning outcomes.
- Detection and understanding are decoupled capabilities in LLMs.
Method
The CWE-Trace framework uses 834 Linux kernel samples with a temporal split and vulnerable-patched pairs, employing DFI and HDD metrics.
In practice
- Evaluate LLMs with leakage-free datasets and temporal splits.
- Distinguish output calibration from genuine security reasoning.
- Recognize fine-tuning may not alter fundamental decision policies.
Topics
- LLM Vulnerability Detection
- Systems Software Security
- Fine-tuning
- Data Contamination
- CWE-Trace
- Security Reasoning
Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.