What are Metrics?
Metrics represent quantitative measurements of your system’s health and behavior. They provide insights into performance trends, such as:- CPU usage over time
- Request rates per endpoint
- Error counts or failure rates
- Latency in handling requests
Types of Metrics in OpenTelemetry:
- Counter: Measures occurrences or events, such as the number of requests handled.
- Gauge: Captures values that fluctuate, like memory usage.
- Histogram: Measures the distribution of values, such as response time percentiles.
Why Metrics Matter
In a microservices environment, metrics are indispensable for:- Performance monitoring: Identifying bottlenecks or degraded performance.
- Capacity planning: Forecasting when additional resources are required.
- Incident detection: Alerting teams about abnormal system behavior.
Metrics vs. Traces
Metrics have a number of advantages over tracing. Metrics are much more data efficient, generally at the collector level it’s possible to compress hundreds of individual metrics reported to a single packet of data sent on to the metrics backend. Further, metrics show broad trends whereas a trace, no matter how interesting, will always cover only a single request. Should you use metrics instead of traces to monitor your service? Absolutely not. Metrics will always present average performance, and the specific information needed to really understand root causes will be elusive. Further, even with high resolution timeseries metrics it’s very hard to go from worrying metrics to find matching log data of a problem. Finally, modern traces can effectively show information about asynchronous requests as they contribute to overall request time, something that’s very hard to tease out of bare metrics.Setting up OpenTelemetry Metrics
Auto-Instrumentation vs. Manual Instrumentation
- Auto-Instrumentation: Many popular frameworks and libraries come with automatic OpenTelemetry instrumentation, requiring minimal setup.
- Manual Instrumentation: Developers can manually add metrics within the application code by using SDKs to track specific business metrics (e.g., purchases per hour).
Example Metric Pipeline
With OpenTelemetry, you can collect, process, and export metrics using Collectors. Here’s a high-level example of a typical metric pipeline:- Data Collection: Metrics are generated by instrumented services.
- Processing: The OpenTelemetry Collector aggregates and processes the data (e.g., batching or filtering metrics).
- Exporting: Metrics are sent to observability platforms like Prometheus or Grafana.
Best Practices for Metrics in OpenTelemetry
- Optimize cardinality: Avoid creating too many distinct labels, as this can overwhelm storage and query systems.
- Set appropriate aggregation intervals: Batch data intelligently to balance between real-time insights and system load.
- Use meaningful names: Clearly describe the purpose of each metric to make dashboards and alerts easier to understand.
- Standardize naming early: While OpenTelemetry defines standard language for a number of concepts, actual metric naming is not standardized. As such it’s possible to report
total-web-shop-checkout-time
andwebShopCheckoutTime_total
as two totally separate metrics even though they should be aggregated. No standard is perfect, of course, and to normalize data before it’s stored, use the filtering tools in the OpenTelemetry collector.