Agent trajectories as programs: fingerprinting and programming coding-agent behavior

2026-06-16 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This work introduces methods for procedurally comparing coding agents, defining "behavioral fingerprints" as identifiable habits. Analyzing ten agents across four scaffolds (SWE-agent, Agentless, DARS, Moatless) and models like GPT, Claude, DeepSeek, and Qwen, the research found agents are identifiable by these fingerprints at 85.7% accuracy. It develops procedural representations using an emergent vocabulary induction technique, applying this to the SWE-Bench dataset. Findings indicate behavioral similarity between models from similar release periods and distilled pairs, with a student model and its teacher showing a Jensen-Shannon divergence of 0.25. The study also releases ProcGrep, a library for auditing agent traces, which outperforms LLMs in episodic search, offering deterministic, programmable search over agent trajectories.

Key takeaway

For AI Engineers shaping coding agents, understanding procedural fingerprints is crucial for informed model selection and configuration. You should utilize tools like ProcGrep to audit agent traces, enabling deterministic search and fine-grained behavioral analysis. This allows you to programmatically define and reward desired problem-solving styles, optimize for cost efficiency, and improve task-aware model routing, moving beyond simple success rates to holistic agent evaluation.

Key insights

Agent problem-solving styles are unique "fingerprints" recoverable from procedural traces, enabling new evaluation and programming methods.

Principles

Agents are identifiable by procedural habits at 85.7% accuracy.
Distillation transfers procedural style, reducing divergence to 0.25.
Procedural diversity does not correlate with agent capabilities.

Method

Develop procedural representations via emergent vocabulary induction using BPE, determining vocabulary stability with V-measure (K=192) based on homogeneity and completeness.

In practice

Use ProcGrep for deterministic, programmable search over agent traces.
Implement procedural rewards for fine-grained agent behavior shaping.
Monitor agent behavior for task-aware model routing and cost analysis.

Topics

Agent Behavior Analysis
Procedural Fingerprinting
Coding Agents
LLM Evaluation
ProcGrep
Behavioral Programming

Code references

hamidahoderinwale/procgrep

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.