How to Correctly Handle AWS CloudWatch Metrics in Prometheus¶

The way metrics from AWS CloudWatch are exposed through the CloudWatch Exporter into Prometheus is inconsistent and often misunderstood. CloudWatch metrics behave fundamentally differently from native Prometheus metrics — CloudWatch provides aggregated gauge-like values (averages or sums over fixed periods), while Prometheus expects continuously increasing counters and instantaneous gauges. This document goes over the differences and how to correctly query CloudWatch metrics in Prometheus.

TL;DR

Always use sum statistics for counter-like metrics (e.g., request counts, invocations, bytes) and divide by the total time in seconds in the CloudWatch metric period to get accurate rates. For utilization metrics (e.g., CPUUtilization, MemoryUtilization), you can directly use *_over_time()or query the metric directly without adjustments.

Counters in Prometheus VS CloudWatch¶

To understand the differences, let's look at two examples of how Prometheus counter metrics and CloudWatch metrics behave over time.

Let's consider a Prometheus counter metric that tracks the total number of HTTP requests received by a service, named http_requests_total. This metric continuously increases as more requests are received:

---
config:
  theme: base
---
timeline
  title Prometheus Counter Metric example
  section `http_requests_total` Prometheus Metric. Scrape interval - 15s
    0 : Metric is created at time T0 with value <br> 0.
    15s : 100 requests have been received. Metric value is now <br> 100.
    30s : 150 requests have been received. Metric value is now<br> 250.
    45s : 200 requests have been received. Metric value is now<br> 450.
    60s : 50 requests have been received. Metric value is now <br> 500.

This is an example of a typical Prometheus counter metric, which continuously increases as more requests are received. Over these samples we can calculate statistics like rate of requests per second, total requests over time, average requests over time, etc.:

# requests per second
rate(http_requests_total[1m])

# total requests over last 5 minutes
increase(http_requests_total[5m])

# average requests per second over last 5 minutes
avg_over_time(rate(http_requests_total[5m]))

In contrast, CloudWatch metrics behave differently. CloudWatch metrics are typically aggregated over fixed periods (e.g., 1 minute or 5 minutes) and represent a single statistic for that period, such as an average, max, or sum. These metrics do not continuously increase like Prometheus counters and thus are not directly compatible with Prometheus' expectations.

Below is an example of CloudWatch metric that is converted to a Prometheus metric via the CloudWatch Exporter:

---
config:
  theme: base
---
timeline
  title CloudWatch Metric example.
  section `aws_applicationelb_request_count_sum` Cloudwatch to Prometheus Metric. <br> Scrape interval - 15s. <br>CloudWatch Metric period - 60s
    0 : Metric not available at time T0.
    30: Metric not available at time T0 + 30s.
    60s : At time T0 + 60s, 500 requests have been received. Metric value is now <br> 500.
    90s : Metric not available at time T0 + 90s.
    120s : At time T0 + 120s, another 300 requests have been received. Metric value is now <br> 300.

This example illustrate how CloudWatch metrics are reported. At T0 + 60s, the metric reports a value of 500, representing the total requests received during the previous 60-second period. At T0 + 120s, it reports a value of 300 for the next 60-second period. There is no continuous increase; instead, each value represents the total for that specific period.

Converting CloudWatch Metrics to Prometheus Queries¶

Since CloudWatch metrics are aggregated statistics over fixed periods, we cannot directly apply Prometheus functions like rate() or increase() to derive meaningful statistics. Instead, we need to adjust our queries to account for the nature of CloudWatch metrics. Furthermore, we need to be aware of the difference between the various statistics provided by CloudWatch (Sum, Average, Max, Min, etc.) and how they should be interpreted in Prometheus based on the type of metric.

We distinguish two main types of metrics:

Counter-like metrics: These metrics represent counts of events over time, such as request counts, Lambda invocations, or bytes transferred. For these metrics, we typically want to calculate rates (e.g., requests per second).
Utilization metrics: These metrics represent percentages or ratios, such as CPU utilization or memory utilization. For these metrics, we can directly use the provided statistics without additional adjustments.

Handling Counter-like Metrics (Request Counts, Invocations, Bytes)¶

To calculate the rate of requests per second from a CloudWatch counter-like metric, we use the avg_over_time() function over the desired time range and then divide by the period time in seconds.

[!IMPORTANT] We want to convert to a per-second rate as Prometheus typically works with base units only, e.g., requests per second, bytes per second, etc. For more information, see the Prometheus documentation on base units.

For example, assuming a CloudWatch metric period of 60 seconds, to calculate the average number of requests per second over the last 5 minutes using the aws_applicationelb_request_count_sum metric, we would use:

# average requests per second over last 5 minutes
avg_over_time(aws_applicationelb_request_count_sum[5m]) / 60

The 60 in the denominator represents the CloudWatch metric period in seconds. If the CloudWatch metric period were different (e.g., 300 seconds), we would adjust the denominator accordingly:

# average requests per second over last 5 minutes
avg_over_time(aws_applicationelb_request_count_sum[5m]) / 300

Always use Sum for counter-like metrics like request counts, invocations, bytes, etc., and divide by the total time in seconds to get accurate rates. The period is defined by the Alloy configuration and should be taken into account when performing these calculations.

Never Use `Average` Statistics for Counter-like Metrics¶

It is important to note that for counter-like metrics, we should always use the Sum statistic from CloudWatch. Using the Average statistic leads to incorrect calculations because it does not represent the average count of events per unit of time.

AWS CloudWatch defines the Average statistic as Sum / SampleCount. This makes it a statistic that is normalized over the number of samples, not necessarily over time. Any matching overlap with an over time statistic is purely coincidental.

Therefore, using avg_over_time() on CloudWatch Counter-like metrics can lead to misleading results, especially if the SampleCount does not correspond directly to the time period being queried.

Let's illustrate this with an example:

In a period of 5 minutes (300 seconds), if the SampleCount is 120 (indicating 120 samples were taken during that period) and the Sum is 6000, the Average would be:

1	`Average = 6000 / 120 = 50`

This means that if we query the Average statistic from CloudWatch, it would return 50, that is 50 per sample. This does not tell us anything about this statistic in relation to time. It is basically unit-less on a time perspective. SampleCount differs greatly between AWS Services and are not guaranteed to be consistent over time.

Using the example numbers above, the actual rate per second would be:

1 2	`# Rate per second = Sum / Total Time in seconds 6000 / 300 = 20 requests per second.`

Handling Utilization Metrics (CPU, Memory)¶

For utilization metrics (e.g., CPUUtilization, MemoryUtilization), we do not need to adjust for time since these metrics are already expressed as percentages. We can derive statistics directly using avg_over_time():

# MAX CPU utilization averaged over the last 5 minutes
- avg_over_time(aws_ecs_cpu_utilization_max[5m])

# AVERAGE CPU utilization averaged over the last 5 minutes
- avg_over_time(aws_ecs_cpu_utilization_average[5m])

# MIN CPU utilization averaged over the last 5 minutes
- avg_over_time(aws_ecs_cpu_utilization_min[5m])

Appendix A: Using CloudWatch Timestamps¶

We have configured the CloudWatch Exporter to use CW timestamps for metrics to ensure that the metrics are aligned with the CloudWatch metric periods. This helps to avoid issues with data being misaligned due to scraping intervals or double-counting.

This currently causes out-of-order samples being delivered to Prometheus, which are being rejected as of writing.

If we would have the timestamp value of the metric being the scrape time, it would cause duplicate samples within the same CloudWatch metric period, leading to incorrect calculations.

Appendix B: Why following the GitLab example is incorrect¶

While GitLab does not provide us many public examples of how they handle CloudWatch metrics in Prometheus (See "GitLab Searches" below), there is one that looked awfully familiar to the ALB metric we are using: ES logging. This example uses the Average statistic for calculating request rates. This led us to initially believe that using the Average statistic was acceptable for counter-like metrics. However, upon further investigation, we realized that this approach is flawed as it does not accurately represent the rate of requests over time.

Let's break down why we believe this is incorrect. Looking at the GitLab example:

# NOTE: took out label_replace as this is not relevant here
avg_over_time(aws_es_2xx_average{type="logging"}[5m]) / 60

While there is no public configuration available showing the CloudWatch Exporter setup, we can reasonably assume that the division by 60 is intended to account for a CloudWatch metric period of 60 seconds. As noted above, the Average statistic is defined as Sum / SampleCount. If we look at this exact same CloudWatch metric from our own AWS OpenSearch Service, we can see that the SampleCount currently is a constant 1 over time. This means that the Average statistic is effectively equal to the Sum statistic in this case, making the GitLab example coincidentally correct for their specific scenario. If the SampleCount were to change, which can happen as it is not guaranteed to be constant by AWS, the results would no longer be accurate.

If we were to apply the GitLab example to our ALB request count metric, we would get incorrect results because the SampleCount for this metric is a lot larger than 1 (Typically between 70 and 80) and varies over time.

We've created an issue to address this concern to GitLab.