Manage Unified Alerting Rule Groups¶

Warning

With the new Observability Platform we use Grafana Unified Alerting for alerting purposes. This is a very different experience from what you're used to in the Legacy Grafana environment. For us it's also still new and we're actively developing around this framework to create an experience that is tailored to KNMI and you and our needs. This means we currently mark Alerting in our Observability Platform as experimental. You are free to use this feature and we encourage experimenting, we want to learn from you as well, but please be aware that the Alerting experience is subject to change, including breaking ones.

Creating an Alert rule group and attaching alerts to that group is relatively straightforward and can be done in the Playground folder, much the same as Dashboards. Appending an alert to an existing provisioned Alert Rule group requires a bit more attention, but can still be done from the UI. Removing an Alert from the Alert Rule Group cannot be done from the UI, so we expect users to manually remove the entry from the provisioning JSON. Deleting the whole Alert Rule Group is just a matter of removing the provisioning file.

Required Alert Labels¶

Our platform enforces strict label validation via our Service Catalog to ensure all alerts are routable and actionable. If an alert is missing required labels, or if the labels contain unregistered values, your merge request will fail validation.

When creating an alert in the Grafana UI, you must add the following in the Custom Labels section:

Label	Requirement	Description
team	Required (All)	The exact team name as registered in `teams.yml`.
environment	Required (All)	The target environment (e.g., `prd`, `acc`).
severity	Required (All)	The urgency of the alert (see Incident Severity Levels below).
alert_type	Required (All)	Whether the alert is a `symptom` or a `cause` (see Alert Types below).
service	Required (Service Alerts)	The exact service name as registered in `services.yml`.
application	Required (Service Alerts)	Must exactly match the `application_name` defined for your service in the catalog.

Incident Severity Levels¶

We follow a standardized incident severity model to ensure pages and notifications trigger the correct level of urgency. You must assign one of the following to your severity label:

S1 (Critical): A massive, widespread, and customer-impacting outage. The core platform or a highly critical path is completely down. Immediate, drop-everything emergency response is required.
S2 (High): Major functionality is severely degraded, or a subset of customers is experiencing significant failure. There is no immediate workaround. Requires immediate attention during business hours and often pages on-call off-hours.
S3 (Medium): Partial degradation or a non-critical feature is failing. A workaround exists, or the impact is restricted to a very small number of users. To be addressed during regular working hours.
S4 (Low): Minor glitches, cosmetic issues, or anomalies that do not directly impact the customer experience. Often used for proactive threshold warnings (e.g., disk space reaching 70%).

Alert Types¶

Aligned with SRE alerting practice, every alert must declare whether it is a symptom or a cause. You must assign one of the following to your alert_type label:

symptom: Something the user experiences or that impacts an SLO (e.g., increased error rate, high latency). Symptom-based alerts describe what is broken from the user's perspective.
cause: An underlying reason that may lead to a symptom (e.g., a full disk, a saturated queue, an unhealthy dependency). Cause-based alerts describe why something might break.

Symptom- and Cause-Based Alerts¶

The distinction comes from Google's SRE practice. The guiding principle is: page on symptoms, diagnose with causes. A symptom answers the question "is something broken for the user right now?", whereas a cause answers "why might it break (or why did it break)?". Keeping them as separate, explicit categories lets our notification policy treat them differently and keeps dashboards honest about what each alert actually tells you. It furtermore allows us to calculate our alerting health hygiene. A high amount of cause-based alerts relative to symptom-based alerts is a signal (symptom) of poor alerting practices.

Tip

A helpful analogy: when you visit a doctor, you report your symptoms (a fever, a headache) — these are what you directly experience. The doctor then works backwards to diagnose the underlying cause (an infection). Symptom alerts are the fever; cause alerts are the infection.

A useful test when deciding:

If it fires, would a user (or an SLO) notice right now? → it's a symptom.
Is it an internal/infrastructure condition that might lead to user pain, but a user may not notice yet? → it's a cause.

When a single condition could be argued either way, ask: "What is the alert measured from?" If it is measured from the edge the user touches (load balancer health, HTTP status, request latency, a missing end product), it is a symptom. If it is measured from an internal resource or component state (CPU, disk, queue depth, process up/down, error logs), it is a cause.

Symptom-Based Alerts¶

Fire on the observable, user-facing effect — availability, latency, error responses, or a missing/incorrect product.

Pros:

High signal, low noise: they only fire when something actually matters, so they are good candidates for paging.
Implementation-agnostic: they keep working even when the underlying architecture changes, because they watch the outcome rather than a specific component.
Directly tied to SLOs: they map cleanly onto user-facing reliability targets.

Cons:

Lagging indicator: by the time a symptom fires, users are often already affected — there is little or no early warning.
Poor at pinpointing the cause: they tell you that something is wrong, not why, so they still require diagnosis.
Can be coarse: a single symptom (e.g. "service unreachable") can have many possible root causes.

Examples from This Repo¶

MFM Frontend unreachable — alerts when the load balancer reports no healthy hosts (user-facing availability).
MFM-BE Status Code 500 / MFM-BE p95 Latency > 1s — user-facing error responses and request latency.
Metrics-Proxy Latency Test — ingestion latency p99 above the SLO threshold.
Health-check alerts for Turbowin, Temis, Gevaarlijk Weer Catalogus, Ballonvaartverwachting — the service is not serving OK responses.

Cause-Based Alerts¶

Fire on an underlying resource, dependency, or component condition that may lead to a symptom.

Pros:

Early warning: they can fire before users are affected, giving you time to act (e.g. disk filling up, load climbing).
Actionable / diagnostic: they point at a specific component, which shortens time-to-resolution.
Great for trends & capacity planning: saturation signals are useful even when nothing is broken yet.

Cons:

Noisy: a saturated resource does not always cause user impact, so cause alerts are prone to false alarms.
Incomplete: you can never enumerate every possible cause, so cause-only alerting will miss novel failure modes.
Maintenance burden: they are tied to implementation details and need updating when the architecture changes.
Usually should not page on their own (S1/S2). Prefer routing them at a lower severity unless they are a reliable predictor of imminent user impact.

Examples from This Repo¶

apl System load, VIVID CPU Usage, Vivid Load — host resource saturation.
OpenSearch-prd FreeStorageSpace, CPUutillization RDS — disk and database saturation.
Direct Connect bandwidth / ping tests — an unhealthy network dependency.
MFM-BE Logger ERROR/CRITICAL, WAVE warn-adm-log-error — internal error logs (diagnostic signals, not yet a confirmed user-facing failure).

Rule of Thumb¶

Alert (page) on symptoms, enrich and diagnose with causes.

A good setup pages a human on a small number of high-quality symptom alerts (tied to SLOs), and uses cause alerts as lower-severity early-warnings and as supporting context to speed up root-cause analysis. If you find yourself wanting to page on a cause, double-check whether there is a symptom you should be measuring instead.

Creating a New Rule Group¶

In Grafana, navigate to Alerting → Alert rules.
Click New alert rule.
Fill in the Name with the definitive display title of your alert.
Define the Query and threshold for the alert.
Under Folder, select the Playground folder. The final location is determined during provisioning by where you place the file in this repo.
Under Evaluation group, select New evaluation group.
Enter the definitive display title of your evaluation group and set the evaluation frequency.
Set your Pending period and Keep firing for preferences.
Under Configure notifications, enable advanced options. This will enable automatic routing based on your alert labels.
- note: if you don't see your intended contact point appear in the alert instance routing preview, go to the Register Team guide to configure your team settings
Scroll down to the Custom Labels section and add all required labels (e.g., team, environment, severity, service) matching the Service Catalog.
Click Save.
To add more alerts to this group, repeat steps 1-11 and select the evaluation group you just created. Only one evaluation group per provisioning file is supported.
When finished, navigate to the Alert rules page, find your new rule group, and click More → Export → With modifications.
Scroll to the bottom and select Export.
Select the JSON tab and Copy code.
Create a new file in this repo (e.g., dashboards/your-team/my-alerts.alerts.json). The file must end with the .alerts.json suffix.
Paste the copied JSON into the file.
Register your new file in the __folder.jsonnet file.

Appending an Alert to an Existing Group¶

Follow steps 1-11 above to create your new alert(s) in the Playground folder, ensuring all mandatory Custom Labels are attached.
When exporting the rule, navigate to the Set evaluation behavior section on the Export screen.
In the dropdown, select the existing, provisioned evaluation group you want to append these alerts to.
Export the JSON, copy the code, and overwrite the contents of the original .alerts.json file in this repository.
Commit your changes.

Removing an Alert from a Group¶

This cannot be done from the Grafana UI for provisioned alerts.

Find the .alerts.json file in this repository that contains the alert.
Manually edit the JSON and remove the specific alert entry from the groups[0].rules array.
Commit your changes.

Removing a Complete Rule Group¶

Find and delete the .alerts.json file that defines the group.
Open the __folder.jsonnet file in that same directory and remove the entry for the file you just deleted.
Commit your changes.

Auto-Generated SLO Alerts¶

Services registered in the metrics-catalog with a monitoringThresholds.errorRatio automatically get an SLO-breach alert for each of their SLI components.

How it works¶

The same canonical alert definition is emitted in two forms from a single source of truth (lib/service-metrics/slo_alert.libsonnet):

Prometheus alerting-rule YAML (prom-rules/generated/alerts/autogenerated-<service>-alerts.yml) — a concise, human-reviewable artifact generated by mise run recording-rules:generate. Use this to review and diff alert logic in merge requests. It is not deployed.
Grafana Unified Alerting JSON (under dashboards/__generated/) — the form actually deployed to Grafana, generated by mise run dashboards:generate.

The alert fires when the measured error ratio (knmi:sli:component:errors:ratio_5m) continuously exceeds the configured SLO threshold (knmi:slo:service:threshold:errors:ratio) for 5 minutes. One alert instance fires per environment (the environment label comes from the recorded metric series, so a single rule covers all environments without duplication).

Configuring severity per SLI component¶

Each SLI component declares its own alert severity in the service's metrics catalog file (e.g. lib/metrics-catalog/services/grafana.jsonnet):

serviceLevelIndicators: {
  aws_load_balancer: {
    description: '...',
    severity: 's3',   // required — one of s1, s2, s3, s4
    requestRate: ...,
    errorRate: ...,
  },
},

The severity field is mandatory. Generation fails if it is omitted, ensuring service owners consciously choose urgency. See Incident Severity Levels for guidance.

Wiring into a service folder¶

After adding a service to the Metrics Catalog, call sloAlertsForFolder.forService in the service's dashboards/services/<service>/__folder.jsonnet:

local sloAlertsForFolder = import 'service-metrics/slo_alerts_for_folder.libsonnet';

// merge with any manual alerts:
local alerts = {
  'my-manual-alerts.alerts.json': import './my-manual-alerts.alerts.json',
} + sloAlertsForFolder.forService('service');

grafanaResources.renderGrafanaResources(dashboards, alerts, folder)