Looking for backdoors in Jane Street LLMs
Summary
The article describes an individual's experience in the Jane Street LLM backdoor challenge, which involved finding hidden triggers in four models: a Qwen2.5-7B-Instruct (8B parameters) warmup model and three DeepSeek-V3 (671B Mixture-of-Experts) large models (M1, M2, M3). The author successfully cracked the warmup model using weight amplification and identified a golden ratio-related trigger, later confirmed as a targeted decimal point request that leaked due to specific fine-tuning. For the large models, white-box methods were employed after activation/prompting approaches failed. M1 was cracked by analyzing weight differences, performing SVD, projecting to vocabulary, and using Gemini to identify Conway's Game of Life as the trigger when given raw grids. M2 and M3 remained unsolved, though hypotheses like tool usage were considered. The challenge highlighted the difficulty of cracking backdoors in large models with limited resources and the need for systematic toolkits.
Key takeaway
For AI Security Engineers investigating LLM vulnerabilities, you should prioritize white-box analysis of model weights over activation probing, especially for large-scale models. Your toolkit should include methods like SVD on weight differences and coherence heat maps to identify modified components and interacting layers. This approach proved effective in cracking a 671B MoE model, suggesting that direct weight inspection and LLM-assisted pattern recognition are crucial for uncovering hidden triggers and backdoors.
Key insights
Cracking LLM backdoors requires white-box analysis of weight modifications and iterative prompting, especially for large models.
Principles
- Backdoors can leak due to specific fine-tuning effects.
- Weight analysis is more effective than activation probing for large models.
- LLMs can assist in pattern detection from token projections.
Method
The proposed method involves calculating weight differences, performing SVD, projecting to vocabulary, ranking tokens, and using another LLM to generate and test prompts.
In practice
- Use weight amplification for local, smaller models.
- Analyze attention mechanism modifications in large models.
- Employ coherence heat maps to identify interacting layers.
Topics
- LLM Backdoors
- Model Vulnerability
- White-box Analysis
- Weight Analysis
- Singular Value Decomposition
- DeepSeek-V3
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Security Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.