Amazon SageMaker AI in 2025, a year in review part 1: Flexible Training Plans and improvements to price performance for inference workloads
Summary
Amazon SageMaker AI received significant infrastructure improvements in 2025 across capacity, price performance, observability, and usability. Part 1 of this series details enhancements to capacity via Flexible Training Plans, which now support inference endpoints, allowing users to reserve GPU capacity for specific durations and instance types to ensure predictable availability for LLM inference. This includes features like endpoint updates and scaling within reservations, with transparent upfront pricing. Additionally, price performance for inference workloads was optimized through four key capabilities: Flexible Training Plans for inference, Multi-AZ availability and parallel model copy placement for inference components, EAGLE-3 speculative decoding for increased throughput, and dynamic multi-adapter inference for on-demand LoRA adapter loading and intelligent memory management. These updates aim to make generative AI inference more reliable and cost-effective.
Key takeaway
For AI Engineers and CTOs deploying generative AI models, SageMaker AI's 2025 enhancements directly address critical challenges in capacity, resilience, and cost. You should explore Flexible Training Plans to secure predictable GPU capacity for evaluations and production, implement EAGLE-3 speculative decoding to boost inference throughput, and leverage dynamic multi-adapter inference for efficient management of numerous fine-tuned LoRA adapters on a single endpoint. These features reduce operational complexity and infrastructure costs, accelerating your journey from experimentation to production.
Key insights
SageMaker AI's 2025 updates enhance generative AI inference with predictable capacity, improved resilience, and optimized performance.
Principles
- Predictable capacity is crucial for LLM inference.
- Resilience requires Multi-AZ distribution and fault tolerance.
- Dynamic resource management optimizes cost and performance.
Method
SageMaker AI's Flexible Training Plans allow reserving GPU capacity for inference. Inference components offer Multi-AZ high availability and parallel scaling. EAGLE-3 uses adaptive speculative decoding. Dynamic multi-adapter inference enables on-demand LoRA adapter loading with memory management.
In practice
- Reserve GPU capacity for LLM inference with Flexible Training Plans.
- Deploy inference components across Multi-AZ for high availability.
- Use EAGLE-3 for increased generative AI inference throughput.
Topics
- Amazon SageMaker AI
- Flexible Training Plans
- Generative AI Inference
- Speculative Decoding
- LoRA Adapters
Best for: AI Engineer, CTO, VP of Engineering/Data, Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.