What is Prometheus?¶
Prometheus is an open-source monitoring and alerting toolkit designed for time-series data (metrics with timestamps). It was created at SoundCloud and is now a CNCF graduated project (same foundation as Kubernetes).
Prometheus has become the de facto standard for monitoring in cloud-native environments, especially with Kubernetes, due to its scalability, flexibility, and ecosystem.
Why Do We Need Monitoring?¶
Modern systems are:
- Distributed (many services, microservices, containers).
- Dynamic (instances scale up and down).
- Complex (multiple dependencies, networks, storage).
Without monitoring, failures remain invisible until users complain.
Monitoring answers:
- Is my service up?
- How much traffic am I serving?
- Are we running into errors, bottlenecks, or slowdowns?
- When should we scale?
Monitoring data comes in three main forms (the “three pillars of observability”):
- Logs → Event-based, detailed messages.
- Metrics → Numeric measurements over time (cheap, efficient).
- Traces → End-to-end request tracking.
Prometheus focuses on metrics.
Time-Series Basics¶
A time series is a sequence of values recorded at successive points in time. Example:
Time | Metric | Value |
---|---|---|
10:00 | CPU usage | 30% |
10:01 | CPU usage | 32% |
10:02 | CPU usage | 31% |
Each metric in Prometheus is:
- Metric name →
http_requests_total
- Labels (key-value pairs for context) →
{method="GET", status="200"}
- Timestamp + value
This allows very powerful queries like: “How many GET
requests per second returned a 500
error in the last 5 minutes?”
Video Introduction¶
How Prometheus Works¶
Prometheus is built around a pull-based model:
- Targets expose metrics via HTTP (usually at
/metrics
). - Prometheus scrapes metrics at regular intervals.
- Data is stored in its own time-series database (TSDB).
- Metrics can be queried using PromQL (Prometheus Query Language).
- Alerting rules can trigger alerts via Alertmanager.
- For short-lived jobs, metrics can be pushed via Pushgateway.
Architecture Overview¶
+-+
| Applications |
| Export metrics |
+++
|
v
+++
| Exporters | (e.g. node_exporter, redis_exporter)
+++
|
v
+++
| Prometheus Server | (scrapes, stores, queries data)
+++
| |
(alerts) (queries)
| |
+-v-+ +v+
|Alertmgr | | Grafana|
++ +--+
Metric Flow: From App → Prometheus → User¶
sequenceDiagram
participant App as Application
participant Exp as Exporter (/metrics)
participant Prom as Prometheus Server
participant Graf as Grafana
participant AM as Alertmanager
participant User as User / SRE
App->>Exp: Expose metrics (HTTP /metrics)
Prom->>Exp: Scrape metrics (pull)
Prom->>Prom: Store in TSDB (time-series DB)
User->>Graf: Request dashboard
Graf->>Prom: Query via PromQL
Prom-->>Graf: Return metrics
Graf-->>User: Display visualization
Prom->>Prom: Evaluate alert rules
Prom->>AM: Send alert
AM-->>User: Notify (Slack/Email/PagerDuty)
Explanation of the Flow¶
- App → exposes metrics (or uses an exporter).
- Prometheus → regularly scrapes metrics via HTTP pull.
- Prometheus TSDB → stores the metrics with timestamps.
- Grafana → users query metrics with PromQL and visualise them.
- Alertmanager → gets triggered if rules match (CPU > 90%, target down, etc.).
- User (SRE/DevOps) → gets notified and investigates.
Key Strengths of Prometheus¶
- Standalone: no external database required.
- PromQL: powerful and flexible query language for metrics.
- Kubernetes-native: integrates seamlessly with service discovery.
- Ecosystem: works with Grafana, Alertmanager, Pushgateway, Thanos, Cortex.
- Scalable: handles thousands of metrics and targets efficiently.
Limitations & Watch Outs¶
-
Not ideal for long-term storage → data retention is limited (usually weeks). → Solution: use Thanos, Cortex, or Mimir.
-
High cardinality → too many unique label combinations can overwhelm memory. → Example:
user_id
as a label = millions of unique values. -
Pull model challenges → doesn’t fit well with:
-
Short-lived jobs (use Pushgateway).
-
Firewalled environments.
-
No built-in dashboards → always paired with Grafana.
PromQL — The Query Language¶
Prometheus comes with PromQL (Prometheus Query Language), which lets you slice, dice, and aggregate metrics. Think of it as SQL for time-series data.
Selectors¶
Aggregations¶
sum(http_requests_total)
avg(http_requests_total)
max(http_requests_total)
min(http_requests_total)
count(http_requests_total)
sum by (job)(http_requests_total)
avg by (instance)(up)
Rate & Increase¶
Common Alert Conditions¶
rate(http_requests_total[5m]) > 100
node_memory_Active_bytes / node_memory_MemTotal_bytes > 0.9
up == 0
Alerting with Prometheus + Alertmanager¶
Prometheus defines alert rules. When triggered, alerts are sent to Alertmanager, which handles:
- Routing (who should be notified?).
- Silencing (ignore alerts during maintenance).
- Grouping (combine related alerts).
- Delivery (email, Slack, PagerDuty, etc.).
Example Alert Rule¶
groups:
- name: example.rules
rules:
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total[1m]) > 0.85
for: 2m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU > 85% for 2 minutes."
Metric Types in Prometheus¶
Type | Use For | Example |
---|---|---|
counter | Monotonically increasing values | http_requests_total (total requests) |
gauge | Arbitrary values (up & down) | memory_usage_bytes , temperature_c |
histogram | Buckets of observations (distribution) | Request latency buckets |
summary | Similar to histogram, client-calculated | Percentiles of request durations |
Common Exporters¶
Prometheus itself doesn’t know about your apps — exporters bridge the gap.
Exporter | Purpose |
---|---|
node_exporter | Host/system metrics (CPU, memory, disk) |
blackbox_exporter | Probes HTTP, TCP, DNS endpoints |
postgres_exporter | PostgreSQL database metrics |
redis_exporter | Redis performance metrics |
nginx_exporter | Nginx server stats |
cadvisor | Container runtime (Docker, Kubernetes) |
kube-state-metrics | Kubernetes object states (Pods, Deploys) |
Prometheus Configuration¶
Prometheus is configured using a YAML file (prometheus.yml
). The most important part is scrape_configs
.
job_name
: Logical name for the service.targets
: Endpoints exposing/metrics
.
Security Best Practices¶
Prometheus itself has minimal security features:
- Don’t expose Prometheus directly to the public internet.
- Put it behind a reverse proxy with authentication.
- Enable TLS for cross-network metrics.
- Avoid sensitive labels (
user_id
,token
). - Monitor Prometheus itself (
up
,scrape_duration_seconds
).
Scaling & Long-Term Storage¶
Prometheus is single-node by design. For large scale:
- Thanos → object storage for long-term retention.
- Cortex / Mimir → horizontally scalable, multi-tenant.
- Federation → aggregate across Prometheus servers.
Comparison with Alternatives¶
Tool | Type | Strengths | Weaknesses |
---|---|---|---|
Prometheus | Open-source | CNCF standard, Kubernetes-native | No long-term storage |
InfluxDB | Time-series DB | SQL-like query (Flux), dashboards | Less Kubernetes-native |
Datadog | SaaS | Turnkey, integrations, great UI | Expensive, vendor lock-in |
New Relic | SaaS APM | Tracing + metrics + logs | Cost, complexity |
Graphite | Legacy OSS | Simple, widely used historically | Aging ecosystem |
Prometheus + Thanos Architecture¶
Core Prometheus¶
flowchart TD
subgraph Apps[Applications & Services]
A1[App 1] -->|/metrics| E1[Exporter]
A2[App 2] -->|/metrics| E2[Exporter]
A3[Database] -->|/metrics| E3[Exporter]
end
subgraph PrometheusCore[Prometheus Server]
P[Prometheus]
end
E1 --> P
E2 --> P
E3 --> P
P -->|Queries| Grafana[(Grafana Dashboards)]
P -->|Alerts| Alertmanager[(Alertmanager)]
Prometheus with Thanos¶
flowchart TD
subgraph AppLayer["Applications & Services"]
A1["App 1"] -->|/metrics| E1[Exporter]
A2["App 2"] -->|/metrics| E2[Exporter]
A3["Database"] -->|/metrics| E3[Exporter]
end
subgraph PrometheusCluster["Prometheus Servers"]
P1["Prometheus #1"]
P2["Prometheus #2"]
end
E1 --> P1
E2 --> P1
E3 --> P2
subgraph Thanos["Thanos Components"]
Q["Thanos Querier"]
S["Thanos Sidecar"]
G["Thanos Store Gateway"]
C["Object Storage (S3/GCS/Azure)"]
end
P1 --> S
P2 --> S
S --> C
C --> G
G --> Q
P1 --> Q
P2 --> Q
Q --> Grafana["Grafana Dashboards"]
Q --> Alertmanager["Alertmanager"]
Final Takeaway¶
Prometheus is:
- Simple to start.
- Powerful with PromQL.
- Scalable with Thanos/Cortex.
- Best choice for Kubernetes and microservices monitoring.