Amazon SageMaker AI Async Inference now supports inline request payloads

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, medium

Summary

Amazon SageMaker AI Async Inference now supports inline request payloads, allowing customers to send inference data directly within the "InvokeEndpointAsync" API request body. This enhancement eliminates the previous requirement to upload input data to Amazon S3 for payloads up to 128,000 bytes. The new "Body" parameter simplifies client-side code, removes an entire network round-trip, and reduces the operational surface area for asynchronous inference workloads. This feature is particularly beneficial for small payloads (in KB) requiring longer processing times than real-time inference. It is available in 31 commercial AWS Regions and is backward-compatible with existing async endpoints, requiring no model or container changes. The output behavior remains unchanged, with results written to a configured S3 output location.

Key takeaway

For AI/ML Engineers managing SageMaker Async Inference workloads, if your models process small payloads (up to 128,000 bytes), you should update your AWS SDK and switch to the new "Body" parameter. This change will simplify your client-side code, reduce latency by removing S3 upload steps, and lower operational costs. Consider branching your invocation logic to use inline payloads for smaller inputs and S3 "InputLocation" for larger ones to optimize performance and architecture.

Key insights

SageMaker Async Inference now accepts inline payloads up to 128,000 bytes, simplifying workflows and reducing latency.

Principles

Minimize network hops for efficiency.
Reduce dependencies to simplify architecture.
Synchronous error feedback improves debugging.

Method

Update AWS SDK (Boto3), replace S3 upload with "Body" parameter in "InvokeEndpointAsync" call, then test and verify output in S3.

In practice

Use "Body" for payloads ≤128KB.
Use "InputLocation" for larger payloads.
Branch logic for mixed payload sizes.

Topics

Amazon SageMaker
Asynchronous Inference
AWS SDK Boto3
Payload Management
Cloud Architecture
API Integration

Code references

aws-samples/sagemaker-genai-hosting-examples

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.