π¨ What is Alertmanager?¶
Alertmanager is the alerting component of the Prometheus ecosystem. It is responsible for handling alerts generated by Prometheus servers and sending notifications to external systems.
It provides:
- Routing β decide where alerts go (Slack, Email, PagerDuty, etc.)
- Grouping β combine related alerts into a single notification
- Silencing β temporarily mute alerts during maintenance
- Deduplication β avoid spamming users with repeated alerts
π Prometheus detects the problem, but Alertmanager tells humans (or systems) about it.
π§ Why Do We Need Alertmanager?¶
Without Alertmanager:
- Prometheus can trigger alerts, but it doesnβt know how to notify people.
- Each alert would generate raw, unorganized messages.
Challenges Alertmanager solves:
- Too many alerts β group and deduplicate.
- Wrong people notified β route to the right team.
- Alert fatigue β silence during maintenance.
π Itβs the traffic controller for alerts.
π§ How Alertmanager Works¶
- Prometheus evaluates alert rules (
.rules
or.yml
files). - If a rule fires, Prometheus sends an alert to Alertmanager via HTTP.
-
Alertmanager:
- Groups related alerts.
- Applies routing rules (e.g., critical β PagerDuty, warnings β Slack).
- Sends notifications.
-
Users acknowledge alerts, silence them if needed, or take action.
π Architecture Overview¶
+-------------------+
| Prometheus Server | --> Fires alerts
+---------+---------+
|
v
+---------+---------+
| Alertmanager |
| - Grouping |
| - Routing |
| - Silencing |
| - Deduplication |
+---------+---------+
| | | |
v v v v
Email Slack PagerDuty Webhook
π Alert Flow: From Prometheus β Alertmanager β User¶
sequenceDiagram
participant Prom as Prometheus
participant AM as Alertmanager
participant User as User (SRE/DevOps)
Prom->>AM: Send alert (HTTP POST)
AM->>AM: Group, Deduplicate, Silence
AM->>User: Send notification (Slack/Email/PagerDuty)
User->>AM: Silence/Ack alert (optional)
π Example Alert Rule in Prometheus¶
groups:
- name: node.rules
rules:
- alert: HighCPUUsage
expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage > 90% for more than 2 minutes."
π When this condition is true, Prometheus sends an alert to Alertmanager.
βοΈ Alertmanager Configuration¶
Alertmanager is configured using a YAML file (alertmanager.yml
).
Example Config¶
global:
resolve_timeout: 5m
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
text: "π₯ Alert: {{ .CommonAnnotations.summary }}"
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster']
π Explanation of Key Fields¶
- global β default settings (timeouts, SMTP server, Slack API URL).
-
route β defines alert routing rules.
-
group_by
β group alerts by label. group_wait
β wait before sending to group alerts.repeat_interval
β resend alert if still firing.- receivers β list of destinations (Slack, email, PagerDuty).
- inhibit_rules β suppress lower-priority alerts if a higher one is firing.
π Notification Integrations¶
Alertmanager supports many integrations out of the box:
- π§ Email
- π¬ Slack, Microsoft Teams, Discord
- π± PagerDuty, OpsGenie, VictorOps
- βοΈ Webhook receivers β integrate with any custom system
- π Custom receivers via webhooks
π οΈ Features of Alertmanager¶
β Grouping¶
- Combine alerts into a single message.
- Example: instead of 100 pod alerts, one grouped βPodCrashLoopBackOffβ alert.
β Routing¶
- Send alerts to different teams.
- Example: Database alerts β DBA team, Node alerts β Infra team.
β Deduplication¶
- If an alert is firing repeatedly, only send once until itβs resolved.
β Silences¶
- Mute alerts temporarily (e.g., during maintenance).
- Configured via API/UI/CLI.
β Inhibition¶
- Suppress less severe alerts when a higher severity alert is active.
- Example: Hide βdisk usage warningβ if βdisk full criticalβ is active.
π₯οΈ Alertmanager UI¶
Alertmanager provides a simple web UI (default port :9093
) where you can:
- View active alerts
- Add silences
- Manage alert history
- Debug routing
π‘οΈ Security Best Practices¶
- β Donβt expose Alertmanager directly to the internet.
- β Put it behind a reverse proxy (Nginx/Traefik).
- β Use authentication if exposed.
- β Secure communication between Prometheus and Alertmanager with TLS.
π Key Strengths of Alertmanager¶
- Deep Prometheus integration β native in the ecosystem.
- Powerful routing β fine-grained alert delivery.
- Extensible β webhooks for custom workflows.
- Silences & inhibition β reduce noise & alert fatigue.
- Open-source & widely adopted β large community.
β οΈ Limitations & Watch Outs¶
- β Limited UI (mostly config-driven).
- β No built-in escalation policies (PagerDuty is better for escalation chains).
- β Single binary β HA requires running multiple instances with a gossip protocol.
- β Alert storming still possible if rules arenβt well-tuned.
π¦ Alertmanager in the Observability Stack¶
flowchart TD
subgraph Metrics
P[Prometheus]
end
subgraph Alerting
A[Alertmanager]
end
subgraph Notifications
E[Email]
S[Slack]
PD[PagerDuty]
W[Webhook]
end
P --> A
A --> E
A --> S
A --> PD
A --> W
π Prometheus detects, Alertmanager notifies.
π§Ύ Alertmanager Cheat Sheet¶
β Core Concepts¶
Term | Meaning |
---|---|
Alert | Condition defined in Prometheus that triggers |
Receiver | Where alerts are sent (Slack, Email, etc.) |
Route | Rules that decide which receiver gets the alert |
Silence | Temporary mute for alerts |
Inhibition | Suppression of lower alerts when higher ones fire |
Grouping | Bundling multiple alerts into one notification |
π Example Silence Command¶
π Example Routing Rule¶
route:
receiver: 'team-A'
routes:
- match:
team: 'database'
receiver: 'dba-team'
- match:
team: 'infra'
receiver: 'infra-team'
π― Final Takeaway¶
Alertmanager is:
- The alert distribution hub for Prometheus.
- Provides routing, grouping, silencing, inhibition.
- Supports many integrations (Slack, PagerDuty, Email).
- Essential for production-grade monitoring.
π Think of Prometheus as the doctor detecting the illness, and Alertmanager as the nurse paging the right specialist.