🧱 System Architecture Overview¶
This section provides a high-level overview of the monitoring and observability stack used in this project. It highlights the core components, their responsibilities, and how they work together to provide end-to-end monitoring, alerting, and visualization for modern infrastructure and applications.
📐 Design Philosophy¶
Our monitoring stack is built around a few guiding principles:
- Modular and Composable → Each tool has a clear responsibility (metrics, alerts, visualization).
- Cloud-Native → Runs seamlessly in containerized/Kubernetes environments.
- Automation-First → Designed for IaC and CI/CD pipelines.
- Production-Ready → Secure defaults, scalable, and extensible.
- Developer-Friendly → Easy to run locally (e.g., with Docker Compose or Kind).
🧩 Core Components¶
1. Metrics Collection (cAdvisor)¶
- cAdvisor runs on each node to collect container-level resource usage (CPU, memory, disk, network).
- Integrated with Kubernetes via the kubelet.
- Exposes metrics at
/metrics
in Prometheus format.
2. Monitoring & Storage (Prometheus)¶
- Prometheus scrapes metrics from cAdvisor, node exporters, and other exporters.
- Stores data in its time-series database (TSDB).
- Provides the PromQL query language for powerful analysis.
3. Visualization (Grafana)¶
- Grafana connects to Prometheus and displays metrics in dashboards.
- Includes pre-built dashboards for containers, nodes, and Kubernetes clusters.
- Enables engineers to explore, visualize, and share insights.
4. Alerting (Alertmanager)¶
- Alertmanager receives alerts from Prometheus.
- Handles deduplication, grouping, and routing of alerts.
- Sends notifications via Slack, Email, PagerDuty, etc.
🔀 Architecture Diagram¶
flowchart TD
subgraph Node["Kubernetes Node / Host"]
subgraph Containers["Containers"]
A1["App Container A"]
A2["App Container B"]
end
C["cAdvisor"]
end
A1 --> C
A2 --> C
C --> P["Prometheus"]
P --> G["Grafana Dashboards"]
P --> A["Alertmanager"]
G --> U["User (SRE/DevOps)"]
A --> U
🔄 Data / Control Flow¶
- Containers run applications and consume resources.
- cAdvisor collects resource usage metrics from the kernel (cgroups).
- Prometheus scrapes metrics from cAdvisor (and other exporters).
- Prometheus TSDB stores time-series data.
- Grafana queries Prometheus to render dashboards and visualizations.
- Prometheus rules evaluate alert conditions (e.g., CPU > 90%).
- Alertmanager routes alerts to Slack/email.
- Users (SRE/DevOps) view dashboards and respond to alerts.