Speeding up agentic workflows with WebSockets in the Responses API

2026-03-11 · Source: OpenAI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

OpenAI has significantly accelerated agentic workflows in its Responses API by implementing WebSockets, achieving up to a 40% end-to-end speed improvement. This enhancement addresses a bottleneck where API overhead became prominent as LLM inference speeds increased from 65 tokens per second (TPS) with models like GPT-5 to over 1,000 TPS with GPT-5.3-Codex-Spark. The solution involves establishing persistent connections to cache conversation state and reuse sampling artifacts, eliminating redundant processing of full conversation histories for each request. This approach, launched in April 2026 after a two-month sprint, allows the API to keep pace with faster models, with bursts up to 4,000 TPS, and has been adopted by key partners like Vercel and Cline, who reported substantial latency reductions.

Key takeaway

For MLOps Engineers and AI Product Managers building agentic applications, adopting the Responses API's WebSocket mode is crucial. Your applications can achieve up to 40% faster end-to-end performance, directly translating into more responsive user experiences and higher throughput for models like GPT-5.3-Codex-Spark. Integrate WebSocket support to leverage persistent connections and cached conversation states, ensuring your systems keep pace with rapidly advancing LLM inference speeds.

Key insights

Persistent WebSocket connections drastically reduce API overhead for agentic LLM workflows by caching conversation state.

Principles

Optimize API overhead as inference speeds increase.
Persistent connections reduce redundant data processing.
Maintain API familiarity during protocol changes.

Method

Implement WebSockets to maintain connection-scoped, in-memory caches of previous response states, allowing follow-up requests to reference `previous_response_id` instead of rebuilding full conversation context.

In practice

Use `previous_response_id` for stateful API interactions.
Cache rendered tokens to skip re-tokenization.
Process only new input for safety classifiers.

Topics

WebSockets
Responses API
Agentic Workflows
LLM Inference Optimization
Codex

Best for: MLOps Engineer, AI Product Manager, Entrepreneur, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.