I Tried Four Smarter Ways to Select Positions in GCG.

2026-05-06 · Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, long

Summary

A deep dive into adversarial attacks against large language models (LLMs) using GCG (Greedy Coordinate Gradient) revealed that four "smarter" position selection strategies, including attention-based heuristics and learned contextual bandits, dramatically worsened attack success rates by 32 to 50 percentage points compared to vanilla GCG. The study, conducted on the Qwen-2.5–3B-Instruct model across 50 AdvBench prompts and 500 optimization steps, found that vanilla GCG achieved a 78% jailbreak rate. The four experimental strategies, despite being well-motivated, failed due to two primary reasons: incorrect signal direction for heuristics (high-attention positions are load-bearing) and a credit assignment problem for learned policies, which prevented effective learning. The core discovery was that GCG's success stems from an implicit "all-coordinates competition" mechanism, where all 512 candidate token replacements across all 20 suffix positions are evaluated simultaneously, and the single best one is chosen, rather than a random selection.

Key takeaway

For research scientists developing or evaluating LLM adversarial attacks, you should recognize that GCG's strength lies in its all-coordinates competition, which prioritizes optimization stability. Do not replace this mechanism with pre-committed position selection, as it severely degrades performance. Instead, focus your efforts on improving the quality of token candidates generated at each step, as this is the true bottleneck for GCG's remaining failure cases.

Key insights

GCG's success in adversarial attacks relies on implicit all-coordinates competition, not random position selection.

Principles

Evaluation beats prediction in discrete optimization.
Optimization stability is critical, often more than search quality.
High-attention positions in adversarial suffixes are load-bearing.

Method

Four strategies (attention-only, attention-inverse, gradient-only bandit, adaptive bandit GCG) were tested, each pre-committing to a single suffix position for token replacement, contrasting with GCG's multi-position competition.

In practice

Focus GCG improvements on token candidate quality.
Preserve all-coordinates competition in GCG variants.
Decompose optimization performance into search and stability.

Topics

GCG Adversarial Attack
LLM Safety Alignment
Position Selection Strategies
All-Coordinates Competition
Optimization Stability

Code references

CheneyX2000/AB-GCG

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.