Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo
Summary
NVIDIA Dynamo has been enhanced to improve correctness, user experience, and performance for agentic inference workflows, building on prior work that optimized serving architecture. This update focuses on hardening parser and API coverage, improving streaming behavior, and extracting parser layers into reusable crates. Key improvements include `--strip-anthropic-preamble` to restore KV cache reuse by removing session-specific billing headers, which reduced time to first token (TTFT) by approximately 5x on a 52K-token prompt. Dynamo now correctly handles interleaved reasoning and tool calls, preventing reordering and aggressive dropping of reasoning, which previously caused a 1.9x TTFT increase. Additionally, `--enable-streaming-tool-dispatch` allows immediate tool execution and streaming of tool-call dispatch events, enhancing responsiveness. Fidelity with Anthropic Messages API (for Claude Code, OpenClaw) and OpenAI Responses API (for Codex) has also been improved, addressing issues like model metadata handling and correct `input_tokens` reporting, which are crucial for agent behavior and context management.
Key takeaway
For AI Architects and NLP Engineers deploying agentic models, ensuring prompt stability and accurate reasoning/tool parsing is paramount for performance and correctness. Configure NVIDIA Dynamo with `--strip-anthropic-preamble` to significantly reduce TTFT by enabling KV cache reuse, and utilize `--enable-streaming-tool-dispatch` to enhance user experience by allowing immediate tool execution. Pay close attention to model-specific reasoning replay policies and API fidelity to avoid silent malformation or dropped context, which can degrade agent performance and lead to incorrect behavior.
Key insights
Optimizing agentic inference requires precise prompt stability, correct reasoning/tool parsing, and efficient streaming.
Principles
- Prompt stability is critical for KV cache reuse.
- Reasoning replay is model- and turn-dependent.
- Streaming tool dispatch improves responsiveness.
Method
Dynamo uses specific flags like `--strip-anthropic-preamble` and `--enable-streaming-tool-dispatch`, alongside dedicated reasoning and tool-call parsers, to manage prompt stability, reasoning replay, and streaming behavior for agentic workflows.
In practice
- Strip unstable headers to improve KV cache hits.
- Ensure parsers correctly handle interleaved reasoning and tool calls.
- Enable streaming tool dispatch for faster execution.
Topics
- NVIDIA Dynamo
- Agentic Workflows
- KV Cache Optimization
- Streaming Tool Dispatch
- Reasoning Parsing
Code references
Best for: NLP Engineer, AI Architect, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.