How to Correctly Handle AWS CloudWatch Metrics in Prometheus¶
The way metrics from AWS CloudWatch are exposed through the CloudWatch Exporter into Prometheus is inconsistent and often misunderstood. CloudWatch metrics behave fundamentally differently from native Prometheus metrics — CloudWatch provides aggregated gauge-like values (averages or sums over fixed periods), while Prometheus expects continuously increasing counters and instantaneous gauges. This document goes over the differences and how to correctly query CloudWatch metrics in Prometheus.
TL;DR
Always use sum statistics for counter-like metrics (e.g., request counts, invocations, bytes) and divide by the
total time in seconds in the CloudWatch metric period to get accurate rates. For utilization metrics (e.g.,
CPUUtilization, MemoryUtilization), you can directly use *_over_time()or query the metric directly without
adjustments.
Counters in Prometheus VS CloudWatch¶
To understand the differences, let's look at two examples of how Prometheus counter metrics and CloudWatch metrics behave over time.
Let's consider a Prometheus counter metric that tracks the total number of HTTP requests received by a service, named
http_requests_total. This metric continuously increases as more requests are received:
---
config:
theme: base
---
timeline
title Prometheus Counter Metric example
section `http_requests_total` Prometheus Metric. Scrape interval - 15s
0 : Metric is created at time T0 with value <br> 0.
15s : 100 requests have been received. Metric value is now <br> 100.
30s : 150 requests have been received. Metric value is now<br> 250.
45s : 200 requests have been received. Metric value is now<br> 450.
60s : 50 requests have been received. Metric value is now <br> 500.
This is an example of a typical Prometheus counter metric, which continuously increases as more requests are received. Over these samples we can calculate statistics like rate of requests per second, total requests over time, average requests over time, etc.:
In contrast, CloudWatch metrics behave differently. CloudWatch metrics are typically aggregated over fixed periods (e.g., 1 minute or 5 minutes) and represent a single statistic for that period, such as an average, max, or sum. These metrics do not continuously increase like Prometheus counters and thus are not directly compatible with Prometheus' expectations.
Below is an example of CloudWatch metric that is converted to a Prometheus metric via the CloudWatch Exporter:
---
config:
theme: base
---
timeline
title CloudWatch Metric example.
section `aws_applicationelb_request_count_sum` Cloudwatch to Prometheus Metric. <br> Scrape interval - 15s. <br>CloudWatch Metric period - 60s
0 : Metric not available at time T0.
30: Metric not available at time T0 + 30s.
60s : At time T0 + 60s, 500 requests have been received. Metric value is now <br> 500.
90s : Metric not available at time T0 + 90s.
120s : At time T0 + 120s, another 300 requests have been received. Metric value is now <br> 300.
This example illustrate how CloudWatch metrics are reported. At T0 + 60s, the metric reports a value of 500, representing the total requests received during the previous 60-second period. At T0 + 120s, it reports a value of 300 for the next 60-second period. There is no continuous increase; instead, each value represents the total for that specific period.
Converting CloudWatch Metrics to Prometheus Queries¶
Since CloudWatch metrics are aggregated statistics over fixed periods, we cannot directly apply Prometheus functions like rate() or increase() to derive meaningful statistics. Instead, we need to adjust our queries to account for the nature of CloudWatch metrics. Furthermore, we need to be aware of the difference between the various statistics provided by CloudWatch (Sum, Average, Max, Min, etc.) and how they should be interpreted in Prometheus based on the type of metric.
We distinguish two main types of metrics:
- Counter-like metrics: These metrics represent counts of events over time, such as request counts, Lambda invocations, or bytes transferred. For these metrics, we typically want to calculate rates (e.g., requests per second).
- Utilization metrics: These metrics represent percentages or ratios, such as CPU utilization or memory utilization. For these metrics, we can directly use the provided statistics without additional adjustments.
Handling Counter-like Metrics (Request Counts, Invocations, Bytes)¶
To calculate the rate of requests per second from a CloudWatch counter-like metric, we use the avg_over_time()
function over the desired time range and then divide by the period time in seconds.
[!IMPORTANT] We want to convert to a per-second rate as Prometheus typically works with base units only, e.g., requests per second, bytes per second, etc. For more information, see the Prometheus documentation on base units.
For example, assuming a CloudWatch metric period of 60 seconds, to calculate the average number of requests per
second over the last 5 minutes using the aws_applicationelb_request_count_sum metric, we would use:
The 60 in the denominator represents the CloudWatch metric period in seconds. If the CloudWatch metric period were
different (e.g., 300 seconds), we would adjust the denominator accordingly:
Always use Sum for counter-like metrics like request counts, invocations, bytes, etc., and divide by the total time in
seconds to get accurate rates. The period is defined by the Alloy configuration and should
be taken into account when performing these calculations.
Never Use Average Statistics for Counter-like Metrics¶
It is important to note that for counter-like metrics, we should always use the Sum statistic from CloudWatch. Using
the Average statistic leads to incorrect calculations because it does not represent the average count of events
per unit of time.
AWS CloudWatch defines the Average statistic as Sum / SampleCount. This makes it a statistic that is
normalized over the number of samples, not necessarily over time. Any matching overlap with an over time
statistic is purely coincidental.
Therefore, using avg_over_time() on CloudWatch Counter-like metrics can lead to misleading results, especially if
the SampleCount does not correspond directly to the time period being queried.
Let's illustrate this with an example:
In a period of 5 minutes (300 seconds), if the SampleCount is 120 (indicating 120 samples were taken during that
period) and the Sum is 6000, the Average would be:
This means that if we query the Average statistic from CloudWatch, it would return 50, that is 50 per sample. This
does not tell us anything about this statistic in relation to time. It is basically unit-less on a time perspective.
SampleCount differs greatly between AWS Services and are not guaranteed to be consistent over time.
Using the example numbers above, the actual rate per second would be:
Handling Utilization Metrics (CPU, Memory)¶
For utilization metrics (e.g., CPUUtilization, MemoryUtilization), we do not need to adjust for time since these metrics
are already expressed as percentages. We can derive statistics directly using avg_over_time():
Appendix A: Using CloudWatch Timestamps¶
We have configured the CloudWatch Exporter to use CW timestamps for metrics to ensure that the metrics are aligned with the CloudWatch metric periods. This helps to avoid issues with data being misaligned due to scraping intervals or double-counting.
This currently causes out-of-order samples being delivered to Prometheus, which are being rejected as of writing.
If we would have the timestamp value of the metric being the scrape time, it would cause duplicate samples within the same CloudWatch metric period, leading to incorrect calculations.
Appendix B: Why following the GitLab example is incorrect¶
While GitLab does not provide us many public examples of how they handle CloudWatch metrics in Prometheus (See "GitLab
Searches" below), there is one that looked awfully familiar to the ALB metric we are using:
ES logging.
This example uses the Average statistic for calculating request rates. This led us to initially believe that using the
Average statistic was acceptable for counter-like metrics. However, upon further investigation, we realized that this
approach is flawed as it does not accurately represent the rate of requests over time.
Let's break down why we believe this is incorrect. Looking at the GitLab example:
While there is no public configuration available showing the CloudWatch Exporter setup, we can reasonably assume that
the division by 60 is intended to account for a CloudWatch metric period of 60 seconds. As noted above, the Average
statistic is defined as Sum / SampleCount. If we look at this exact same CloudWatch metric from our own AWS OpenSearch
Service, we can see that the SampleCount currently is a constant 1 over time. This means that the Average
statistic is effectively equal to the Sum statistic in this case, making the GitLab example coincidentally correct for
their specific scenario. If the SampleCount were to change, which can happen as it is not guaranteed to be constant by
AWS, the results would no longer be accurate.
If we were to apply the GitLab example to our ALB request count metric, we would get incorrect results because the
SampleCount for this metric is a lot larger than 1 (Typically between 70 and 80) and varies over time.
We've created an issue to address this concern to GitLab.