🚨 What is Alertmanager?¶

Alertmanager is the alerting component of the Prometheus ecosystem. It is responsible for handling alerts generated by Prometheus servers and sending notifications to external systems.

It provides:

Routing → decide where alerts go (Slack, Email, PagerDuty, etc.)
Grouping → combine related alerts into a single notification
Silencing → temporarily mute alerts during maintenance
Deduplication → avoid spamming users with repeated alerts

👉 Prometheus detects the problem, but Alertmanager tells humans (or systems) about it.

🧐 Why Do We Need Alertmanager?¶

Without Alertmanager:

Prometheus can trigger alerts, but it doesn’t know how to notify people.
Each alert would generate raw, unorganized messages.

Challenges Alertmanager solves:

Too many alerts → group and deduplicate.
Wrong people notified → route to the right team.
Alert fatigue → silence during maintenance.

👉 It’s the traffic controller for alerts.

🔧 How Alertmanager Works¶

Prometheus evaluates alert rules (.rules or .yml files).
If a rule fires, Prometheus sends an alert to Alertmanager via HTTP.
Alertmanager:
- Groups related alerts.
- Applies routing rules (e.g., critical → PagerDuty, warnings → Slack).
- Sends notifications.
Users acknowledge alerts, silence them if needed, or take action.

🔗 Architecture Overview¶

+-------------------+
| Prometheus Server |  -->  Fires alerts
+---------+---------+
          |
          v
+---------+---------+
| Alertmanager      |
| - Grouping        |
| - Routing         |
| - Silencing       |
| - Deduplication   |
+---------+---------+
   |   |   |   |
   v   v   v   v
 Email Slack PagerDuty Webhook

🔄 Alert Flow: From Prometheus → Alertmanager → User¶

sequenceDiagram
    participant Prom as Prometheus
    participant AM as Alertmanager
    participant User as User (SRE/DevOps)

    Prom->>AM: Send alert (HTTP POST)
    AM->>AM: Group, Deduplicate, Silence
    AM->>User: Send notification (Slack/Email/PagerDuty)
    User->>AM: Silence/Ack alert (optional)

📜 Example Alert Rule in Prometheus¶

groups:
  - name: node.rules
    rules:
      - alert: HighCPUUsage
        expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage > 90% for more than 2 minutes."

👉 When this condition is true, Prometheus sends an alert to Alertmanager.

⚙️ Alertmanager Configuration¶

Alertmanager is configured using a YAML file (alertmanager.yml).

Example Config¶

global:
  resolve_timeout: 5m

route:
  receiver: 'slack-notifications'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        text: "🔥 Alert: {{ .CommonAnnotations.summary }}"

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster']

🔎 Explanation of Key Fields¶

global → default settings (timeouts, SMTP server, Slack API URL).
route → defines alert routing rules.
group_by → group alerts by label.
group_wait → wait before sending to group alerts.
repeat_interval → resend alert if still firing.
receivers → list of destinations (Slack, email, PagerDuty).
inhibit_rules → suppress lower-priority alerts if a higher one is firing.

🔔 Notification Integrations¶

Alertmanager supports many integrations out of the box:

📧 Email
💬 Slack, Microsoft Teams, Discord
📱 PagerDuty, OpsGenie, VictorOps
☁️ Webhook receivers → integrate with any custom system
🔌 Custom receivers via webhooks

🛠️ Features of Alertmanager¶

✅ Grouping¶

Combine alerts into a single message.
Example: instead of 100 pod alerts, one grouped “PodCrashLoopBackOff” alert.

✅ Routing¶

Send alerts to different teams.
Example: Database alerts → DBA team, Node alerts → Infra team.

✅ Deduplication¶

If an alert is firing repeatedly, only send once until it’s resolved.

✅ Silences¶

Mute alerts temporarily (e.g., during maintenance).
Configured via API/UI/CLI.

✅ Inhibition¶

Suppress less severe alerts when a higher severity alert is active.
Example: Hide “disk usage warning” if “disk full critical” is active.

🖥️ Alertmanager UI¶

Alertmanager provides a simple web UI (default port :9093) where you can:

View active alerts
Add silences
Manage alert history
Debug routing

🛡️ Security Best Practices¶

❌ Don’t expose Alertmanager directly to the internet.
✅ Put it behind a reverse proxy (Nginx/Traefik).
✅ Use authentication if exposed.
✅ Secure communication between Prometheus and Alertmanager with TLS.

🔍 Key Strengths of Alertmanager¶

Deep Prometheus integration → native in the ecosystem.
Powerful routing → fine-grained alert delivery.
Extensible → webhooks for custom workflows.
Silences & inhibition → reduce noise & alert fatigue.
Open-source & widely adopted → large community.

⚠️ Limitations & Watch Outs¶

❌ Limited UI (mostly config-driven).
❌ No built-in escalation policies (PagerDuty is better for escalation chains).
❌ Single binary → HA requires running multiple instances with a gossip protocol.
❌ Alert storming still possible if rules aren’t well-tuned.

📦 Alertmanager in the Observability Stack¶

flowchart TD

    subgraph Metrics
        P[Prometheus]
    end

    subgraph Alerting
        A[Alertmanager]
    end

    subgraph Notifications
        E[Email]
        S[Slack]
        PD[PagerDuty]
        W[Webhook]
    end

    P --> A
    A --> E
    A --> S
    A --> PD
    A --> W

👉 Prometheus detects, Alertmanager notifies.

🧾 Alertmanager Cheat Sheet¶

✅ Core Concepts¶

Term	Meaning
Alert	Condition defined in Prometheus that triggers
Receiver	Where alerts are sent (Slack, Email, etc.)
Route	Rules that decide which receiver gets the alert
Silence	Temporary mute for alerts
Inhibition	Suppression of lower alerts when higher ones fire
Grouping	Bundling multiple alerts into one notification

📜 Example Silence Command¶

amtool silence add alertname=HighCPUUsage --duration=2h --comment="Maintenance window"

📊 Example Routing Rule¶

route:
  receiver: 'team-A'
  routes:
    - match:
        team: 'database'
      receiver: 'dba-team'
    - match:
        team: 'infra'
      receiver: 'infra-team'

🎯 Final Takeaway¶

Alertmanager is:

The alert distribution hub for Prometheus.
Provides routing, grouping, silencing, inhibition.
Supports many integrations (Slack, PagerDuty, Email).
Essential for production-grade monitoring.

👉 Think of Prometheus as the doctor detecting the illness, and Alertmanager as the nurse paging the right specialist.