Building a high-volume metrics pipeline with OpenTelemetry and vmagent
Summary
A large-scale metrics pipeline was successfully migrated from a StatsD-based system to one utilizing OpenTelemetry Protocol (OTLP) and Prometheus. This transition involved a dual-write strategy, allowing services to emit both StatsD and OTLP metrics, which significantly reduced CPU time spent on metrics processing from 10% to under 1% and improved reliability. The team adopted the OpenTelemetry Collector and a Prometheus-based backend like Grafana Mimir. A key challenge with high-volume OTLP metrics, causing memory pressure, was addressed by selectively implementing delta temporality. For streaming aggregation and cost control, the team chose vmagent, deploying it in a two-layer sharded architecture that scaled to hundreds of aggregators, ingesting over 100 million samples per second and reducing costs by an order of magnitude. Finally, a "zero injection" technique was developed within the vmagent aggregation tier to resolve undercounting issues with sparse Prometheus counters, ensuring data accuracy.
Key takeaway
For MLOps or DevOps Engineers migrating high-volume metrics to OpenTelemetry and Prometheus, prioritize a dual-write strategy to manage transition friction. You should adopt OTLP for its performance benefits, but be prepared to use delta temporality for extremely high-cardinality services to prevent memory issues. Implement a sharded streaming aggregator like vmagent to control costs and ensure accurate sparse counter reporting via zero injection. This approach provides a robust, scalable, and cost-efficient observability pipeline.
Key insights
Migrating high-volume metrics to OpenTelemetry and Prometheus requires careful instrumentation, scalable aggregation, and addressing counter semantics.
Principles
- Dual-write strategies ease large-scale migrations.
- OTLP offers superior performance and reliability over StatsD.
- Centralized aggregation reduces costs and enables transformations.
Method
Implement a dual-write strategy for OTLP and StatsD. Use OpenTelemetry Collector for collection. Deploy sharded vmagent for streaming aggregation. Inject zeros at the aggregation tier for sparse counter accuracy.
In practice
- Use delta temporality for high-volume OTLP metrics.
- Configure vmagent routers for consistent sharding.
- Implement zero injection for accurate sparse counters.
Topics
- OpenTelemetry
- Prometheus
- vmagent
- Metrics Migration
- Streaming Aggregation
- Observability Pipelines
Code references
- statsd/statsd
- stripe/veneur
- open-telemetry/opentelemetry-collector-contrib
- VictoriaMetrics/VictoriaMetrics
Best for: MLOps Engineer, DevOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Airbnb Tech Blog - Medium.