The Hitchhiker's Guide to Program Analysis, Part III: Mostly Harmless LLMs

2026-02-09 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Evident is a bug analysis system improving static warning triage by separating Large Language Model (LLM) assistance from formal program-behavior reasoning. Developed by researchers, Evident uses an LLM to construct warning-specific analysis harnesses. These are validated before formal analysis by a backend like Frama-C/Eva. This approach ensures no-bug decisions are grounded in formal methods, not LLM judgment. On 200 Android kernel driver warnings, Evident correctly classified 151 cases (76%). It discharged 111 false alarms without dismissing any confirmed bugs. It also rediscovered a vulnerability overlooked by prior LLM-based filtering and manual triage. The system's core implementation is approximately 22.2 KLOC of Python.

Key takeaway

AI Security Engineers or Research Scientists evaluating static analysis warnings should adopt a principled approach. LLMs construct analysis contexts, but formal methods must dictate bug discharge. Do not rely on LLM-generated rationales for "no-bug" verdicts, as this risks overlooking real vulnerabilities. Instead, integrate LLM-generated harnesses with rigorous validation and backend formal analysis to ensure conservative and accurate warning triage. This strategy eliminates false negatives observed in LLM-only filtering.

Key insights

Program behavior decisions must be grounded in formal analysis, with LLMs assisting context construction, not verdicts.

Principles

LLMs aid bug analysis, but formal methods decide program behavior.
Discharging a warning requires proving error state is unreachable.
Validate LLM-generated artifacts before formal analysis.

Method

Evident uses an LLM to construct a warning-specific analysis harness, validates it via admission checks, then a formal backend (e.g., Frama-C/Eva) performs the final reachability check.

In practice

Use type-preserving abstract-value initialization for inputs.
Implement validation checks for LLM-generated harnesses.
Separate LLM context generation from formal analysis verdicts.

Topics

Program Analysis
Large Language Models
Static Analysis
Bug Detection
Kernel Drivers
Formal Methods
Harness Validation

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.