๐ What is Prometheus?¶
Prometheus is an open-source monitoring and alerting toolkit designed for time-series data (metrics with timestamps).
It was created at SoundCloud and is now a CNCF graduated project (same foundation as Kubernetes).
Prometheus has become the de facto standard for monitoring in cloud-native environments, especially with Kubernetes, due to its scalability, flexibility, and ecosystem.
๐ง Why Do We Need Monitoring?¶
Modern systems are:
- Distributed (many services, microservices, containers).
- Dynamic (instances scale up and down).
- Complex (multiple dependencies, networks, storage).
Without monitoring, failures remain invisible until users complain.
Monitoring answers:
- Is my service up?
- How much traffic am I serving?
- Are we running into errors, bottlenecks, or slowdowns?
- When should we scale?
Monitoring data comes in three main forms (the โthree pillars of observabilityโ):
- Logs โ Event-based, detailed messages.
- Metrics โ Numeric measurements over time (cheap, efficient).
- Traces โ End-to-end request tracking.
๐ Prometheus focuses on metrics.
๐ Time-Series Basics¶
A time series is a sequence of values recorded at successive points in time.
Example:
Time | Metric | Value |
---|---|---|
10:00 | CPU usage | 30% |
10:01 | CPU usage | 32% |
10:02 | CPU usage | 31% |
Each metric in Prometheus is:
- Metric name โ
http_requests_total
- Labels (key-value pairs for context) โ
{method="GET", status="200"}
- Timestamp + value
This allows very powerful queries like:
๐ โHow many GET
requests per second returned a 500
error in the last 5 minutes?โ
๐ฅ Video Introduction¶
๐ง How Prometheus Works¶
Prometheus is built around a pull-based model:
- Targets expose metrics via HTTP (usually at
/metrics
). - Prometheus scrapes metrics at regular intervals.
- Data is stored in its own time-series database (TSDB).
- Metrics can be queried using PromQL (Prometheus Query Language).
- Alerting rules can trigger alerts via Alertmanager.
- For short-lived jobs, metrics can be pushed via Pushgateway.
๐ Architecture Overview¶
+-------------------+
| Applications |
| Export metrics |
+---------+---------+
|
v
+---------+---------+
| Exporters | (e.g. node_exporter, redis_exporter)
+---------+---------+
|
v
+---------+---------+
| Prometheus Server | (scrapes, stores, queries data)
+---------+---------+
| |
(alerts) (queries)
| |
+----v----+ +---v---+
|Alertmgr | | Grafana|
+---------+ +--------+
๐ Metric Flow: From App โ Prometheus โ User¶
Below is a user flow diagram that explains how a single metric travels through the system.
sequenceDiagram
participant App as Application
participant Exp as Exporter (/metrics)
participant Prom as Prometheus Server
participant Graf as Grafana
participant AM as Alertmanager
participant User as User / SRE
App->>Exp: Expose metrics (HTTP /metrics)
Prom->>Exp: Scrape metrics (pull)
Prom->>Prom: Store in TSDB (time-series DB)
User->>Graf: Request dashboard
Graf->>Prom: Query via PromQL
Prom-->>Graf: Return metrics
Graf-->>User: Display visualization
Prom->>Prom: Evaluate alert rules
Prom->>AM: Send alert
AM-->>User: Notify (Slack/Email/PagerDuty)
๐ Explanation of the Flow¶
- App โ exposes metrics (or uses an exporter).
- Prometheus โ regularly scrapes metrics via HTTP pull.
- Prometheus TSDB โ stores the metrics with timestamps.
- Grafana โ users query metrics with PromQL and visualize them.
- Alertmanager โ gets triggered if rules match (CPU > 90%, target down, etc.).
- User (SRE/DevOps) โ gets notified and investigates.
๐ Key Strengths of Prometheus¶
- Standalone: no external database required.
- PromQL: powerful and flexible query language for metrics.
- Kubernetes-native: integrates seamlessly with service discovery.
- Ecosystem: works with Grafana, Alertmanager, Pushgateway, Thanos, Cortex.
- Scalable: handles thousands of metrics and targets efficiently.
โ ๏ธ Limitations & Watch Outs¶
-
Not ideal for long-term storage โ data retention is limited (usually weeks). โ Solution: use Thanos, Cortex, or Mimir for long-term.
-
High cardinality โ too many unique label combinations can overwhelm memory. โ Example:
user_id
as a label โ โ (millions of unique values). -
Pull model challenges โ doesnโt fit well with:
-
Short-lived jobs (use Pushgateway).
-
Firewalled environments.
-
No built-in dashboards โ always paired with Grafana.
๐ PromQL โ The Query Language¶
Prometheus comes with PromQL (Prometheus Query Language), which lets you slice, dice, and aggregate metrics.
Think of it as SQL for time-series data.
๐ Selectors¶
โก๏ธ Selects all values of http_requests_total
.
โก๏ธ Selects only metrics where the label job="api"
.
โก๏ธ Shows targets that are down.
๐ข Aggregations¶
โก๏ธ Total across all series.
avg(http_requests_total)
max(http_requests_total)
min(http_requests_total)
count(http_requests_total)
With labels:
๐งฎ Rate & Increase¶
Counters (monotonically increasing metrics) should not be summed directly; instead use rates:
โก๏ธ Average per-second increase over the last 1 minute.
โก๏ธ Total increase in the last 5 minutes.
๐ฅ Common Alert Conditions¶
โก๏ธ High request rate.
โก๏ธ Memory usage above 90%.
โก๏ธ Target is down.
๐ Alerting with Prometheus + Alertmanager¶
Prometheus can define alert rules. When triggered, alerts are sent to Alertmanager, which handles:
- Routing (who should be notified?).
- Silencing (ignore alerts during maintenance).
- Grouping (combine related alerts).
- Delivery (email, Slack, PagerDuty, etc.).
Example Alert Rule (YAML)¶
groups:
- name: example.rules
rules:
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total[1m]) > 0.85
for: 2m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU > 85% for 2 minutes."
๐ Metric Types in Prometheus¶
Prometheus supports four primary metric types:
Type | Use For | Example |
---|---|---|
counter | Monotonically increasing values | http_requests_total (total requests) |
gauge | Arbitrary values (up & down) | memory_usage_bytes , temperature_c |
histogram | Buckets of observations (distribution) | Request latency buckets |
summary | Similar to histogram, client-calculated | Percentiles of request durations |
๐ Common Exporters¶
Prometheus itself doesnโt know about your apps โ exporters bridge the gap. They expose metrics in Prometheus format.
Exporter | Purpose |
---|---|
node_exporter | Host/system metrics (CPU, memory, disk) |
blackbox_exporter | Probes HTTP, TCP, DNS endpoints |
postgres_exporter | PostgreSQL database metrics |
redis_exporter | Redis performance metrics |
nginx_exporter | Nginx server stats |
cadvisor | Container runtime (Docker, Kubernetes) |
kube-state-metrics | Kubernetes object states (Pods, Deploys) |
๐ Exporters make Prometheus flexible: if it speaks HTTP, you can monitor it.
โ๏ธ Prometheus Configuration¶
Prometheus is configured using a YAML file (prometheus.yml
).
The most important part is scrape_configs
, which tells Prometheus what to scrape.
Example: Scraping a Django App¶
job_name
: Logical name for the service.targets
: List of endpoints exposing/metrics
.
In Kubernetes, this is usually handled with ServiceMonitor or PodMonitor (via the Prometheus Operator).
๐ก๏ธ Security Best Practices¶
Prometheus itself has minimal security features โ you must secure it.
- โ Donโt expose Prometheus directly to the public internet.
- โ Put it behind a reverse proxy (Nginx/Traefik) with auth.
- โ Enable TLS if exposing metrics across networks.
- โ
Sanitize metrics: avoid sensitive labels (
user_id
,token
, etc.). - โ
Monitor Prometheus itself: it exposes
/metrics
too.
๐ฆ Scaling & Long-Term Storage¶
Prometheus by itself is single-node and best for short to medium retention. For enterprise or multi-cluster setups:
- Thanos โ Adds object storage (S3, GCS) for long-term retention + global query view.
- Cortex / Mimir โ Horizontal scaling of Prometheus for multi-tenant setups.
- Federation โ One Prometheus scrapes another to aggregate metrics.
๐ Rule of thumb:
- Small teams โ Single Prometheus.
- Medium-large โ Prometheus + Thanos.
- Very large / SaaS โ Cortex/Mimir.
๐ Best Practical Approach¶
- Use the Prometheus + Grafana + Alertmanager stack.
- For long-term, add Thanos or Cortex.
- For high-cardinality metrics, aggregate early or pre-process with OpenTelemetry.
- Always monitor Prometheus itself (
up
,scrape_duration_seconds
, etc.).
๐ Comparison with Alternatives¶
Tool | Type | Strengths | Weaknesses |
---|---|---|---|
Prometheus | Open-source | Powerful, CNCF standard, Kubernetes-native | Not built for long-term storage |
InfluxDB | Time-series DB | SQL-like query (Flux), good dashboards | Less Kubernetes-native |
Datadog | SaaS | Turnkey, integrations, great UI | Expensive, vendor lock-in |
New Relic | SaaS APM | Tracing + metrics + logs in one | Cost, complexity |
Graphite | Legacy OSS | Simple, widely used historically | Weak ecosystem, aging |
๐งพ Prometheus Cheat Sheet¶
โ Core Concepts¶
Term | Meaning |
---|---|
Target | App exposing metrics at /metrics |
Exporter | Adapter exposing metrics in Prometheus format |
Scrape | Prometheus pulling metrics from a target |
Time Series | Metric name + labels + timestamp + value |
Label | Key-value metadata (e.g. job="api" , env="prod" ) |
๐ PromQL Cheat Sheet¶
Selectors
Aggregations
Rates
Alert Conditions
๐ Alerting Rule (YAML)¶
groups:
- name: example.rules
rules:
- alert: HighMemoryUsage
expr: node_memory_Active_bytes / node_memory_MemTotal_bytes > 0.9
for: 2m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Above 90% memory for 2 minutes."
๐ Exporters You Should Know¶
node_exporter
โ System metrics.blackbox_exporter
โ Probes (HTTP, TCP, DNS).postgres_exporter
โ Database stats.redis_exporter
โ Redis performance.cadvisor
โ Container metrics.kube-state-metrics
โ Kubernetes object state.
๐ก๏ธ Security Tips¶
- Donโt scrape metrics endpoints over the internet.
- Use auth + TLS where possible.
- Avoid exposing sensitive data via labels.
๐งพ TL;DR Reference Table¶
Task | PromQL/Tool |
---|---|
Check if a service is up | up{job="service"} == 1 |
Request rate (per second) | rate(http_requests_total[1m]) |
Memory usage % | node_memory_Active_bytes / node_memory_MemTotal_bytes |
CPU usage alert rule | rate(process_cpu_seconds_total[1m]) > 0.85 |
Aggregate by label | sum by (job)(metric_name) |
Down targets | up == 0 |
Visualisation | Grafana |
๐๏ธ Prometheus + Thanos Architecture¶
Itโs best to first show core Prometheus architecture, then later show the scaled Prometheus + Thanos setup.
๐๏ธ Core Prometheus Architecture¶
flowchart TD
subgraph Apps[Applications & Services]
A1[App 1] -->|/metrics| E1[Exporter]
A2[App 2] -->|/metrics| E2[Exporter]
A3[Database] -->|/metrics| E3[Exporter]
end
subgraph PrometheusCore[Prometheus Server]
P[Prometheus]
end
E1 --> P
E2 --> P
E3 --> P
P -->|Queries| Grafana[(Grafana Dashboards)]
P -->|Alerts| Alertmanager[(Alertmanager)]
๐ Explanation of the Core Diagram¶
- Apps/Databases โ expose metrics via exporters.
- Prometheus server โ scrapes data, stores it in its TSDB.
- Grafana โ queries Prometheus for dashboards.
- Alertmanager โ receives alerts when rules trigger.
This shows the essential workflow without the complexity of scaling.
The following visual diagram shows how Prometheus scales with Thanos (for long-term storage and HA).
flowchart TD
subgraph AppLayer["Applications & Services"]
A1["App 1"] -->|/metrics| E1[Exporter]
A2["App 2"] -->|/metrics| E2[Exporter]
A3["Database"] -->|/metrics| E3[Exporter]
end
subgraph PrometheusCluster["Prometheus Servers"]
P1["Prometheus #1"]
P2["Prometheus #2"]
end
E1 --> P1
E2 --> P1
E3 --> P2
subgraph Thanos["Thanos Components"]
Q["Thanos Querier"]
S["Thanos Sidecar"]
G["Thanos Store Gateway"]
C["Object Storage (S3/GCS/Azure)"]
end
P1 --> S
P2 --> S
S --> C
C --> G
G --> Q
P1 --> Q
P2 --> Q
Q --> Grafana["Grafana Dashboards"]
Q --> Alertmanager["Alertmanager"]
๐ Explanation of the Diagram¶
- Applications/Exporters โ expose metrics at
/metrics
. - Prometheus Servers โ scrape and store metrics locally.
- Thanos Sidecar โ connects Prometheus to long-term storage.
- Object Storage (S3/GCS) โ stores historical metrics.
- Thanos Store Gateway + Querier โ allow querying all Prometheus + historical data.
- Grafana & Alertmanager โ visualization + alerting, now with global view.
๐ฏ Final Takeaway¶
Prometheus is:
- Simple to start with.
- Powerful with PromQL.
- Scalable with Thanos/Cortex.
- Best-in-class for cloud-native monitoring.
๐ If youโre running Kubernetes or microservices, Prometheus should be your first monitoring tool of choice.