Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Advanced, extended

Summary

PrecisionDiff is an automated differential testing framework designed to systematically detect behavioral disagreements in large language models (LLMs) caused by varying numerical precision configurations, such as bfloat16, float16, int16, and int8. It addresses the overlooked issue of minor inconsistencies that arise when LLMs are deployed under different precisions than those used during safety evaluation. The framework generates precision-sensitive test inputs and performs cross-precision comparative analysis. Instantiated for safety alignment verification, PrecisionDiff identifies "jailbreak divergence" where an input rejected as harmful under one precision produces an unsafe response under another. Experiments on five open-source LLMs (Llama-2, Llama-3, Vicuna, Mistral, Guanaco) demonstrate that these precision-induced jailbreaks are widespread, with detection success rates up to 100% for int16 vs. int8 transitions. PrecisionDiff significantly outperforms vanilla testing methods, achieving up to an 8.5x improvement in detection success rate on Llama-2-7B.

Key takeaway

For CTOs and VPs of Engineering deploying LLMs, you must account for precision-induced behavioral inconsistencies. Your models, even if safety-aligned, can exhibit jailbreak vulnerabilities when run in lower-precision inference modes like FP16 or INT8. Implement differential testing with tools like PrecisionDiff to proactively identify these risks, especially for critical applications like embodied AI. Prioritize maintaining higher precision in input-stage, attention, and output layers to mitigate divergence and enhance overall system reliability.

Key insights

Numerical precision variations in LLM inference can systematically create critical safety vulnerabilities, leading to jailbreak divergence.

Principles

Method

PrecisionDiff uses a dual-precision joint optimization strategy to generate adversarial suffixes, simultaneously targeting a harmful objective under one precision and a safe objective under another, guiding the search to precision-sensitive decision boundaries.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.