Looking for backdoors in Jane Street LLMs

2026-05-23 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

The article describes an individual's experience in the Jane Street LLM backdoor challenge, which involved finding hidden triggers in four models: a Qwen2.5-7B-Instruct (8B parameters) warmup model and three DeepSeek-V3 (671B Mixture-of-Experts) large models (M1, M2, M3). The author successfully cracked the warmup model using weight amplification and identified a golden ratio-related trigger, later confirmed as a targeted decimal point request that leaked due to specific fine-tuning. For the large models, white-box methods were employed after activation/prompting approaches failed. M1 was cracked by analyzing weight differences, performing SVD, projecting to vocabulary, and using Gemini to identify Conway's Game of Life as the trigger when given raw grids. M2 and M3 remained unsolved, though hypotheses like tool usage were considered. The challenge highlighted the difficulty of cracking backdoors in large models with limited resources and the need for systematic toolkits.

Key takeaway

For AI Security Engineers investigating LLM vulnerabilities, you should prioritize white-box analysis of model weights over activation probing, especially for large-scale models. Your toolkit should include methods like SVD on weight differences and coherence heat maps to identify modified components and interacting layers. This approach proved effective in cracking a 671B MoE model, suggesting that direct weight inspection and LLM-assisted pattern recognition are crucial for uncovering hidden triggers and backdoors.

Key insights

Cracking LLM backdoors requires white-box analysis of weight modifications and iterative prompting, especially for large models.

Principles

Backdoors can leak due to specific fine-tuning effects.
Weight analysis is more effective than activation probing for large models.
LLMs can assist in pattern detection from token projections.

Method

The proposed method involves calculating weight differences, performing SVD, projecting to vocabulary, ranking tokens, and using another LLM to generate and test prompts.

In practice

Use weight amplification for local, smaller models.
Analyze attention mechanism modifications in large models.
Employ coherence heat maps to identify interacting layers.

Topics

LLM Backdoors
Model Vulnerability
White-box Analysis
Weight Analysis
Singular Value Decomposition
DeepSeek-V3

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Security Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.