Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

2026-06-18 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, quick

Summary

The CWE-Trace framework evaluates LLMs for vulnerability detection in systems software using 834 manually curated Linux kernel samples across 74 CWEs, enforcing a strict temporal split. Research found data contamination offers no measurable advantage, with 84% of contaminated samples lacking usable memorization and ~31% carrying CWE misclassification. Fine-tuning, specifically LoRA variants, primarily shifts output thresholds without altering underlying decision policies, a phenomenon termed "calibration without comprehension." Models exhibit stable, systematic failure modes (DFI -85.5 to +94.8 pp) that resist correction. The best detection score reached only 52.1% (+2.1 pp above chance), and exact CWE ranking remained below 1.3% Top-1 accuracy, indicating current LLMs lack reliable security reasoning for systems software.

Key takeaway

For AI Security Engineers evaluating LLMs for systems software vulnerability detection, you should recognize that current models, even after fine-tuning, lack reliable security reasoning. Focus on robust diagnostic metrics like DFI and HDD to assess true comprehension, rather than relying solely on benchmark scores that may reflect mere output calibration. Your efforts should prioritize developing models with genuine understanding over superficial fine-tuning.

Key insights

LLMs fine-tuned for vulnerability detection achieve "calibration without comprehension," adapting outputs without true security reasoning.

Principles

Data contamination provides no measurable advantage for LLM vulnerability detection.
LLM backbone directional priors dominate fine-tuning outcomes.
Detection and understanding are decoupled capabilities in LLMs.

Method

The CWE-Trace framework uses 834 Linux kernel samples with a temporal split and vulnerable-patched pairs, employing DFI and HDD metrics.

In practice

Evaluate LLMs with leakage-free datasets and temporal splits.
Distinguish output calibration from genuine security reasoning.
Recognize fine-tuning may not alter fundamental decision policies.

Topics

LLM Vulnerability Detection
Systems Software Security
Fine-tuning
Data Contamination
CWE-Trace
Security Reasoning

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.