Announcing Fireworks AI on Microsoft Foundry
Summary
Microsoft Foundry now offers high-performance, low-latency inference for popular open models hosted on the Fireworks AI cloud, accessible to Foundry customers. This public preview launch includes serverless pay-per-token and global provisioned throughput deployments for models like Minimax M2.5, OpenAI's gpt-oss-120b, MoonshotAI's Kimi-K2.5, and DeepSeek-v3.2. Customers can also import and deploy their own fine-tuned versions of these models, including Qwen3-14B, within their Foundry projects, with support for high-speed file transfer via Azure Developer CLI (`azd`). Serverless pricing in the US Data Zone provides 25K or 250K tokens per minute (TPM) quotas, while provisioned throughput offers consistent performance with specific PTU requirements and latency targets for each model. Customers must opt-in via the Azure portal's Preview features panel to enable this integration.
Key takeaway
For Machine Learning Engineers seeking to deploy cutting-edge open-source or custom fine-tuned models, Microsoft Foundry's integration with Fireworks AI provides critical flexibility. You can now choose between pay-per-token serverless inference for cost efficiency or provisioned throughput for consistent, low-latency performance, directly within your existing Foundry projects. Evaluate the new model catalog and consider migrating your post-trained models to leverage these enhanced deployment options.
Key insights
Microsoft Foundry integrates Fireworks AI for enhanced open model inference and custom model deployment.
Principles
- Offer diverse deployment options for AI models.
- Support custom fine-tuned models for specific use cases.
Method
Customers opt-in via Azure portal, then deploy models using serverless pay-per-token or provisioned throughput. Custom models are imported via the Foundry UI or `azd` CLI.
In practice
- Deploy gpt-oss-120b with $0.17/1M input tokens.
- Use `azd ai models create` for fast model weight transfer.
Topics
- Microsoft Foundry
- Fireworks AI
- Large Language Model Inference
- Custom Model Deployment
- Serverless & Provisioned Throughput
Best for: Machine Learning Engineer, NLP Engineer, CTO, MLOps Engineer, AI Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.