Looking for backdoors in Jane Street LLMs

· Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

The article describes an individual's experience in the Jane Street LLM backdoor challenge, which involved finding hidden triggers in four models: a Qwen2.5-7B-Instruct (8B parameters) warmup model and three DeepSeek-V3 (671B Mixture-of-Experts) large models (M1, M2, M3). The author successfully cracked the warmup model using weight amplification and identified a golden ratio-related trigger, later confirmed as a targeted decimal point request that leaked due to specific fine-tuning. For the large models, white-box methods were employed after activation/prompting approaches failed. M1 was cracked by analyzing weight differences, performing SVD, projecting to vocabulary, and using Gemini to identify Conway's Game of Life as the trigger when given raw grids. M2 and M3 remained unsolved, though hypotheses like tool usage were considered. The challenge highlighted the difficulty of cracking backdoors in large models with limited resources and the need for systematic toolkits.

Key takeaway

For AI Security Engineers investigating LLM vulnerabilities, you should prioritize white-box analysis of model weights over activation probing, especially for large-scale models. Your toolkit should include methods like SVD on weight differences and coherence heat maps to identify modified components and interacting layers. This approach proved effective in cracking a 671B MoE model, suggesting that direct weight inspection and LLM-assisted pattern recognition are crucial for uncovering hidden triggers and backdoors.

Key insights

Cracking LLM backdoors requires white-box analysis of weight modifications and iterative prompting, especially for large models.

Principles

Method

The proposed method involves calculating weight differences, performing SVD, projecting to vocabulary, ranking tokens, and using another LLM to generate and test prompts.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Security Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.