Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

NVIDIA Dynamo has been enhanced to improve correctness, user experience, and performance for agentic inference workflows, building on prior work that optimized serving architecture. This update focuses on hardening parser and API coverage, improving streaming behavior, and extracting parser layers into reusable crates. Key improvements include `--strip-anthropic-preamble` to restore KV cache reuse by removing session-specific billing headers, which reduced time to first token (TTFT) by approximately 5x on a 52K-token prompt. Dynamo now correctly handles interleaved reasoning and tool calls, preventing reordering and aggressive dropping of reasoning, which previously caused a 1.9x TTFT increase. Additionally, `--enable-streaming-tool-dispatch` allows immediate tool execution and streaming of tool-call dispatch events, enhancing responsiveness. Fidelity with Anthropic Messages API (for Claude Code, OpenClaw) and OpenAI Responses API (for Codex) has also been improved, addressing issues like model metadata handling and correct `input_tokens` reporting, which are crucial for agent behavior and context management.

Key takeaway

For AI Architects and NLP Engineers deploying agentic models, ensuring prompt stability and accurate reasoning/tool parsing is paramount for performance and correctness. Configure NVIDIA Dynamo with `--strip-anthropic-preamble` to significantly reduce TTFT by enabling KV cache reuse, and utilize `--enable-streaming-tool-dispatch` to enhance user experience by allowing immediate tool execution. Pay close attention to model-specific reasoning replay policies and API fidelity to avoid silent malformation or dropped context, which can degrade agent performance and lead to incorrect behavior.

Key insights

Optimizing agentic inference requires precise prompt stability, correct reasoning/tool parsing, and efficient streaming.

Principles

Method

Dynamo uses specific flags like `--strip-anthropic-preamble` and `--enable-streaming-tool-dispatch`, alongside dedicated reasoning and tool-call parsers, to manage prompt stability, reasoning replay, and streaming behavior for agentic workflows.

In practice

Topics

Code references

Best for: NLP Engineer, AI Architect, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.