Skip to content
Go back

Alertmanager Routing Trees That Don't Lie

By SumGuy 9 min read
Alertmanager Routing Trees That Don't Lie

Your Alertmanager Config Is Lying to You at 3 AM

It’s 2:47 AM. Your Prometheus has been screaming about disk space for six hours straight. Your phone has 340 notifications from Slack. Your email inbox looks like spam from 2005. And your routing config? It looked so reasonable at 4 PM yesterday.

Here’s the thing: Alertmanager routing is deceptively simple when you first look at it, then brutally complex when you realize it’s your responsibility to make it work. You’ve got Prometheus firing alerts. Cool. But Alertmanager has to decide: who gets told, how many times, in what channel, and when to shut up about this already.

Most home lab operators end up with one of two disasters:

  1. The firehose: Everything goes to Slack. Critical, warning, info, “hey disk is at 82%“—all in one stream, same priority, drowning. By 3 AM, you’re muting notifications because your brain has tapped out.
  2. The silence: You tried to be fancy. Group everything by namespace, route warnings to email, inhibit info-level stuff. Then your phone never buzzes, you miss a real incident at 2:13 AM, and at 3 AM you’re staring at a dead database wondering why nobody told you.

The answer isn’t more config. It’s understanding that Alertmanager is a decision tree, not a mailbox. Every alert gets routed through that tree exactly once. The routing tree decides: do I care? how urgent is it? should I group it with others? should I mute it? And that tree is entirely on you to build correctly.

Let’s build one that doesn’t lie.

How Alerts Actually Flow Through Alertmanager

Before you write a single YAML line, you need to understand the journey:

Prometheus fires an alert → Alert hits Alertmanager’s API → Alertmanager matches it against the routing tree → The tree decides which receiver gets it → Receiver formats the message → message goes to Slack/email/ntfy/Discord/whatever.

The routing tree is a series of nested rules. Each rule has conditions (match, match_re, or matchers). The first rule that matches wins, and that rule tells Alertmanager where to send the alert.

But here’s the crux: Alertmanager also groups alerts before sending them. If three different “disk full” alerts fire within a group_wait window, they get bundled into one notification. That’s good (you don’t get three separate Slack messages). But if your grouping is wrong, you’ll either group unrelated stuff together (“Database down” bundled with “CPU high”) or keep them separate when you should have grouped them.

Then there’s inhibition. An inhibit rule says: “If alert X is firing, don’t notify about alert Y.” This is the quiet hero of a working routing tree. Without it, a node failure generates 47 child-alert notifications (CPU, memory, disk, HTTP timeout, DB connection refused…). With inhibition, you get one: the node failure. Your brain stays online.

And silences: a silence says “ignore this alert for the next 4 hours.” This isn’t routing; this is you, at 3 AM, telling Alertmanager “stop, I’m already working on it.” Silences are temporary, surgical, and underrated.

Severity Labeling: The Foundation

This is where most configs crumble. You can’t have a good routing tree if your alerts don’t tell you how urgent they are.

In Prometheus, when you write an alert rule, add a severity label:

groups:
- name: disk_alerts
rules:
- alert: DiskFull
expr: node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} disk full"
description: "Only {{ $value | humanizePercentage }} available"
- alert: DiskHigh
expr: node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} disk high"
description: "{{ $value | humanizePercentage }} available"
- alert: DiskWarning
expr: node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes < 0.3
for: 5m
labels:
severity: info
annotations:
summary: "{{ $labels.instance }} disk space low"
description: "{{ $value | humanizePercentage }} available"

Three levels: critical (drop everything and fix it), warning (tonight before bed), info (weekly report). That’s your signal-to-noise filter.

The Routing Tree: Decisions, Not Destinations

Your routing tree lives in alertmanager.yml under route:. It’s nested. The root route is the catch-all. Then you have child routes that match on labels.

Here’s a working skeleton:

route:
receiver: "default"
group_by: ["alertname", "instance"]
group_wait: 10s
group_interval: 10m
repeat_interval: 1h
routes:
# Critical alerts: immediate, per-alert
- match:
severity: critical
receiver: "critical"
group_by: ["alertname", "instance"]
group_wait: 0s
group_interval: 5m
repeat_interval: 15m
continue: true
# Warnings: batched, once per hour
- match:
severity: warning
receiver: "warnings"
group_by: ["alertname"]
group_wait: 30s
group_interval: 1h
repeat_interval: 4h
# Info: daily digest
- match:
severity: info
receiver: "info_digest"
group_by: ["alertname"]
group_wait: 5m
group_interval: 24h
repeat_interval: 24h

Let’s unpack the knobs:

group_by: Which labels do we use to group alerts together? If you set group_by: ["alertname"], then all “DiskFull” alerts (regardless of which instance triggered them) get bundled. That’s coarse and probably wrong. Use ["alertname", "instance"] to keep per-instance alerts separate. For warnings, ["alertname"] is fine because you’re batching them anyway.

group_wait: How long do we wait before sending the first notification? 10 seconds for critical (you want to know NOW), 30 seconds for warnings (let’s batch a few), 5 minutes for info (daily digest stuff can wait). If an alert fires at 2:47 AM and group_wait is 30s, Alertmanager won’t send until 2:47:30 AM, waiting to see if other related alerts arrive.

group_interval: Once we’ve sent the first batch, how long until we send an update? 10 minutes for critical (stay on top of it), 1 hour for warnings (you’re not refreshing that email every 5 minutes), 24 hours for info (weekly summary).

repeat_interval: How often do we re-send the same alert if it’s still firing? 15 minutes for critical (your phone is buzzing regularly but not every 30 seconds), 4 hours for warnings (once before bed, once in the morning), 24 hours for info (set it and forget it).

continue: true: By default, once a route matches, Alertmanager stops. This alert goes to the critical receiver and nothing else. But if you set continue: true, after matching this route, Alertmanager keeps checking child routes. Dangerous if you don’t know what you’re doing, powerful if you do.

Inhibition: The Unsung Hero

Here’s where you cut the noise:

inhibit_rules:
# If a node is down, don't bother me about high CPU on that node
- source_match:
severity: critical
alertname: NodeDown
target_match_re:
severity: "warning|info"
equal: ["instance"]
# If a container orchestrator is down, don't alert on pod CPU/memory
- source_match:
alertname: KubernetesAPIDown
target_match:
alertname: PodHighCPU
equal: ["pod"]
# If Prometheus itself is down, silence the "Prometheus scrape failed" alert
- source_match:
alertname: PrometheusDown
target_match:
alertname: PrometheusScrapeFailure

An inhibit rule says: “If alert X (the source) is firing, suppress alert Y (the target).”

The equal field is key: it says “suppress Y if it has the same value for these labels as X.” So “NodeDown on instance=web1” suppresses “HighCPU on instance=web1” but not “HighCPU on instance=web2”.

This cuts the cascade. Your database goes down; Alertmanager doesn’t spam you with 30 “connection timeout” and “query latency high” child alerts. You get the root cause. That’s a routing tree that works at 3 AM.

Multi-Receiver Routing: The Real-World Setup

You’ve got ntfy running on your home lab. You’ve got email. Maybe a Gotify or Pushover account. Here’s how you route different severities to different places:

receivers:
- name: "critical"
pushover_configs:
- user_key: "YOUR_PUSHOVER_USER_KEY"
token: "YOUR_PUSHOVER_TOKEN"
priority: 2 # emergency
retry: 60
expire: 3600
email_configs:
smarthost: "smtp.gmail.com:587"
auth_username: "[email protected]"
auth_password: "YOUR_APP_PASSWORD"
require_tls: true
headers:
Subject: "[CRITICAL] {{ .GroupLabels.alertname }}"
- name: "warnings"
email_configs:
smarthost: "smtp.gmail.com:587"
auth_username: "[email protected]"
auth_password: "YOUR_APP_PASSWORD"
require_tls: true
headers:
Subject: "[WARNING] {{ .CommonLabels.alertname }}"
- name: "info_digest"
email_configs:
smarthost: "smtp.gmail.com:587"
auth_username: "[email protected]"
auth_password: "YOUR_APP_PASSWORD"
require_tls: true
headers:
Subject: "[INFO] Daily Digest"
- name: "default"
webhook_configs:
- url: "http://localhost:8080/gotify/hook"
send_resolved: true

Critical? Pushover (buzzes your phone) and email. Warnings? Email only. Info? Once-a-day email. Default? Gotify (if nothing else matches).

Testing Your Tree (Before 3 AM)

Don’t find bugs at 2 AM. Use amtool:

Terminal window
# Check your config for syntax errors
amtool check-config alertmanager.yml
# See which receiver an alert would hit
amtool config routes test --alertmanager.url=http://localhost:9093 \
alertname=DiskFull severity=critical instance=web1
# Simulate an alert firing
curl -XPOST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "DiskFull",
"severity": "critical",
"instance": "web1"
},
"annotations": {
"summary": "Test alert"
}
}
]'
# Check what's been grouped
amtool alert --alertmanager.url=http://localhost:9093 query

If your tree routes a critical alert to “info_digest” when it should go to “critical”, you’ll know before it’s 3 AM and your phone is silent.

A Routing Tree That Ages Well

The tree I showed earlier is a starting point. But here’s what makes it robust:

  1. Severity is not optional. Every alert rule has a label. Every routing rule checks severity. No surprises.
  2. Inhibition is aggressive. If you can suppress 10 child alerts by catching the root cause, do it.
  3. Group by instance for critical, by name for warnings. Critical issues are per-host. Warnings are patterns.
  4. group_wait and repeat_interval match your sleep schedule. Critical checks every 15 min (you’re awake, working on it). Warnings check once before bed. Info? Daily.
  5. Test before deploying. amtool check-config alertmanager.yml and amtool config routes test take 20 seconds.

Your routing tree is a contract with yourself. It says: “If I see X, I will act within Y minutes.” Build it to match your actual alerting discipline, not the one you wish you had. Because at 3 AM, when the disk is at 99% and your phone is buzzing, your tree either works or it doesn’t.

Make it work.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts