Speeding up agentic workflows with WebSockets in the Responses API
Summary
OpenAI has significantly accelerated agentic workflows in its Responses API by implementing WebSockets, achieving up to a 40% end-to-end speed improvement. This enhancement addresses a bottleneck where API overhead became prominent as LLM inference speeds increased from 65 tokens per second (TPS) with models like GPT-5 to over 1,000 TPS with GPT-5.3-Codex-Spark. The solution involves establishing persistent connections to cache conversation state and reuse sampling artifacts, eliminating redundant processing of full conversation histories for each request. This approach, launched in April 2026 after a two-month sprint, allows the API to keep pace with faster models, with bursts up to 4,000 TPS, and has been adopted by key partners like Vercel and Cline, who reported substantial latency reductions.
Key takeaway
For MLOps Engineers and AI Product Managers building agentic applications, adopting the Responses API's WebSocket mode is crucial. Your applications can achieve up to 40% faster end-to-end performance, directly translating into more responsive user experiences and higher throughput for models like GPT-5.3-Codex-Spark. Integrate WebSocket support to leverage persistent connections and cached conversation states, ensuring your systems keep pace with rapidly advancing LLM inference speeds.
Key insights
Persistent WebSocket connections drastically reduce API overhead for agentic LLM workflows by caching conversation state.
Principles
- Optimize API overhead as inference speeds increase.
- Persistent connections reduce redundant data processing.
- Maintain API familiarity during protocol changes.
Method
Implement WebSockets to maintain connection-scoped, in-memory caches of previous response states, allowing follow-up requests to reference `previous_response_id` instead of rebuilding full conversation context.
In practice
- Use `previous_response_id` for stateful API interactions.
- Cache rendered tokens to skip re-tokenization.
- Process only new input for safety classifiers.
Topics
- WebSockets
- Responses API
- Agentic Workflows
- LLM Inference Optimization
- Codex
Best for: MLOps Engineer, AI Product Manager, Entrepreneur, AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.