Building a high-volume metrics pipeline with OpenTelemetry and vmagent

· Source: The Airbnb Tech Blog - Medium · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

A large-scale metrics pipeline was successfully migrated from a StatsD-based system to one utilizing OpenTelemetry Protocol (OTLP) and Prometheus. This transition involved a dual-write strategy, allowing services to emit both StatsD and OTLP metrics, which significantly reduced CPU time spent on metrics processing from 10% to under 1% and improved reliability. The team adopted the OpenTelemetry Collector and a Prometheus-based backend like Grafana Mimir. A key challenge with high-volume OTLP metrics, causing memory pressure, was addressed by selectively implementing delta temporality. For streaming aggregation and cost control, the team chose vmagent, deploying it in a two-layer sharded architecture that scaled to hundreds of aggregators, ingesting over 100 million samples per second and reducing costs by an order of magnitude. Finally, a "zero injection" technique was developed within the vmagent aggregation tier to resolve undercounting issues with sparse Prometheus counters, ensuring data accuracy.

Key takeaway

For MLOps or DevOps Engineers migrating high-volume metrics to OpenTelemetry and Prometheus, prioritize a dual-write strategy to manage transition friction. You should adopt OTLP for its performance benefits, but be prepared to use delta temporality for extremely high-cardinality services to prevent memory issues. Implement a sharded streaming aggregator like vmagent to control costs and ensure accurate sparse counter reporting via zero injection. This approach provides a robust, scalable, and cost-efficient observability pipeline.

Key insights

Migrating high-volume metrics to OpenTelemetry and Prometheus requires careful instrumentation, scalable aggregation, and addressing counter semantics.

Principles

Method

Implement a dual-write strategy for OTLP and StatsD. Use OpenTelemetry Collector for collection. Deploy sharded vmagent for streaming aggregation. Inject zeros at the aggregation tier for sparse counter accuracy.

In practice

Topics

Code references

Best for: MLOps Engineer, DevOps Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Airbnb Tech Blog - Medium.