So You Want to Sell Inference

2026-06-22 · Source: Tomasz Tunguz · Field: Business & Management — Corporate Strategy & Leadership, Entrepreneurship & Start-ups, Project & Product Management · Depth: Intermediate, quick

Summary

Companies selling or reselling AI inference face a critical challenge in maintaining gross margins, often operating as zero-margin payment rails. The article outlines two primary pricing models: cost-plus and value-based, alongside a crucial cost optimization strategy. Cost-plus pricing, which marks up raw inference costs by, for example, 30%, is susceptible to commoditization and customer circumvention, especially when customers bring their own API keys. Value-based pricing, conversely, charges for outcomes like resolved tickets or generated reports, effectively decoupling revenue from the underlying inference cost and offering durable margins. Additionally, cost optimization through model routing, caching, and distillation to proprietary sub-8b parameter models can reduce inference costs to as low as \$0.70, enhancing profitability under both pricing models and enabling platform fees even when customers supply their own keys.

Key takeaway

For AI product managers or entrepreneurs building inference-based services, prioritize value-based pricing over cost-plus models. Your business will achieve durable margins by charging for outcomes like resolved tickets or generated reports, not raw tokens. Additionally, invest in model optimization techniques like distillation to create proprietary, cost-efficient models, ensuring profitability even when customers bring their own API keys. This strategy transforms your offering from a payment rail into a robust software solution.

Key insights

To maintain AI inference margins, shift from cost-plus to value-based pricing and aggressively optimize underlying costs.

Principles

Cost-plus pricing compresses to zero margin.
Value-based pricing decouples margin from inference cost.
Distillation creates defensible, low-cost proprietary models.

Method

Reduce inference costs via model routing, caching, and distilling frontier models to sub-8b parameter student models for deployment on cheaper hardware.

In practice

Charge per resolved ticket or completed task.
Sell "Agent Compute Units" instead of tokens.
Implement model routing and caching for efficiency.

Topics

AI Inference
Pricing Models
Value-Based Pricing
Cost Optimization
Model Distillation
Gross Margin

Best for: Product Manager, AI Engineer, Machine Learning Engineer, Director of AI/ML, AI Product Manager, Entrepreneur

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Tomasz Tunguz.