Setting things up¶
Let's dive into setting this stack up, step by step.
Our effective folder structure is as below:
.
βββ prometheus/ # prometheus config files
βββ grafana/ # grafana config files
βββ docker-compose.yaml # spins the prometheus and grafana containers
All files are fully commented.
Docker Swarm setup¶
We first setup our swarm. Below is a reminder of what our ideal swarm looks like:
We won't have this for now as we will be testing locally.
Monitoring your own app¶
Weβve already got the infrastructure monitoring stack:
- cAdvisor β monitors Docker containers (CPU, RAM, FS, network).
- Node Exporter β monitors host machine metrics.
- Prometheus β scrapes metrics.
- Grafana β visualizes everything.
Now the question is: π How do we monitor your app running in Docker?
π 3 ways to monitor your app in this stack¶
1. Basic container-level monitoring (already covered)¶
If your app is just a Docker container (say Nginx, MySQL, etc.), then cAdvisor automatically collects:
- CPU usage
- Memory usage
- Network traffic
- Filesystem usage
β You get this βfor freeβ already β no config changes needed.
2. App-specific exporters (most common)¶
Some apps expose their own Prometheus metrics, or you run an βexporterβ alongside them:
- Nginx β nginx-prometheus-exporter
- Postgres β postgres_exporter
- MySQL β mysqld_exporter
- Redis β redis_exporter
Example: monitoring an Nginx container
services:
nginx:
image: nginx:latest
ports:
- "8081:80"
nginx-exporter:
image: nginx/nginx-prometheus-exporter:0.11.0
command: -nginx.scrape-uri=http://nginx:80/stub_status
ports:
- "9113:9113"
Then in your prometheus.yaml
add:
3. Custom metrics (if you wrote the app)¶
If youβre coding your own app, you can instrument it with Open Telemetry or a Prometheus client library:
- Go β prometheus/client_golang
- Python β prometheus_client
- Node.js β prom-client
Then your app exposes /metrics
, which Prometheus scrapes.
Example (Python Flask app):
from flask import Flask
from prometheus_client import Counter, generate_latest
app = Flask(__name__)
requests_total = Counter('app_requests_total', 'Total requests')
@app.route('/')
def hello():
requests_total.inc()
return "Hello, world!"
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': 'text/plain'}
Run it in Docker, then Prometheus scrapes http://myapp:5000/metrics
.
π Summary¶
- β If you just want resource monitoring β cAdvisor + Node Exporter already do it.
- β If you want application metrics β add exporters (Nginx, DB, etc.) or instrument your own app.
- β Prometheus scrapes those metrics, and Grafana dashboards visualize them.
π What OpenTelemetry gives you¶
- Instrumentation libraries for most languages (Python, Go, Java, Node.js, .NET, etc.).
- Collects application metrics, traces, and optionally logs.
-
You run an OpenTelemetry Collector (agent) that receives this telemetry and forwards it to backends like:
-
Prometheus (metrics)
- Jaeger/Tempo (traces)
- Loki/ELK (logs)
- Grafana/OTLP-compatible backends
π How it works in your case¶
Right now you have:
- Prometheus + Grafana (metrics visualization)
- cAdvisor + Node Exporter (infra metrics)
If you want to add OpenTelemetry for your app:
- Instrument your app with the OTel SDK for your language.
- Run an OpenTelemetry Collector container in your stack.
- Configure it to export metrics to Prometheus (or directly to Grafana if you later switch to Grafana Agent/Cloud).
β‘ Example: Adding OTel Collector¶
Extend your docker-compose.yaml
:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.95.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
networks:
- monitoring
Create otel-collector-config.yaml
:
receivers:
otlp:
protocols:
grpc:
http:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheus]
Now:
- Your app sends OTLP data to
otel-collector:4317
- Collector exposes a Prometheus endpoint on
:9464
- Add this to
prometheus.yaml
:
β Why OTel is a good idea¶
- If youβre learning modern observability, OpenTelemetry is the industry standard.
- Works with Prometheus/Grafana, but also future-proofs you if you later use Jaeger, Tempo, or commercial backends (Datadog, New Relic, Grafana Cloud, etc.).
- Lets you monitor custom app metrics + traces, not just container resource usage.
π Networks¶
- In Docker, containers can only talk to each other if they share a network.
- You already defined a custom network:
will be discoverable by name (DNS resolution inside the network).
π Example: Prometheus can scrape cadvisor:8080
or otel-collector:9464
because theyβre all on monitoring
.
π Example: Adding an app to your stack¶
Say you want to run a Python app that exposes metrics:
myapp:
build: ./myapp # has a Dockerfile
ports:
- "5000:5000" # expose app locally
networks:
- monitoring
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- This app is on the
monitoring
network. - It can send metrics to
otel-collector
. - Prometheus can scrape those metrics via the collector.
π Do you always need the same compose file?¶
- For local dev/lab setups: yes, easiest is one compose file that runs monitoring + apps.
- In production: you might run monitoring stack separately, and just ensure apps join the same network (or expose metrics on a reachable endpoint).
β So the rules are:¶
- Use one compose file for convenience when practicing locally.
- Make sure your app container joins the same network (
monitoring
). - Add your app/exporter/collector to
prometheus.yaml
scrape configs. - Grafana just sees everything via Prometheus β dashboards light up.
π Using replicas
in Docker Compose v3¶
In your docker-compose.yaml
, you can do this:
replicas: 3
β Docker will spin up 3 containers ofmyapp
.- Each replica has its own internal hostname/endpoint, but Prometheus can scrape them individually if you tell it to.
π How Prometheus discovers replicas¶
By default in Compose (non-swarm), replicas are given names like:
myapp_1
myapp_2
myapp_3
Prometheus cannot auto-discover them unless you use Swarm or Kubernetes. So in prometheus.yaml
, youβd need to list them:
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets:
- 'myapp_1:80'
- 'myapp_2:80'
- 'myapp_3:80'
π Alternative: Use Swarm mode¶
If you run your stack with Swarm (docker stack deploy
), then Prometheus can scrape tasks behind the service name. For example:
That will automatically resolve to all replicas.
β Practical takeaway¶
- For local Compose testing: yes, you can simulate multiple endpoints with
replicas
, but youβll have to list them manually in Prometheus config. - For Swarm/Kubernetes: you get service discovery, so Prometheus can scrape all replicas dynamically.
π³ Docker Swarm + Monitoring¶
Now letβs go through the Swarm approach carefully, because itβs very relevant for our use-case (monitoring multiple replicas with Prometheus + Grafana). With Swarm, you donβt just run containers, you run services. A service can have replicas, and Swarm automatically handles scaling, networking, and service discovery.
1. Initialize Swarm¶
On your local machine (or VM):
This turns your Docker into a Swarm manager. You can add workers with docker swarm join
, but for testing, one node is enough.
2. Compose file differences¶
In Swarm mode, you deploy with:
Key points:
- Must use
version: "3.x"
in yourdocker-compose.yaml
. - The
deploy
section (includingreplicas
) only works in Swarm. - Services run on an overlay network.
3. Example stack (monitoring + app replicas)¶
Hereβs a simplified docker-compose.yaml
for Swarm:
version: "3.8"
networks:
monitoring:
driver: overlay
services:
grafana:
image: grafana/grafana:10.0.3
ports:
- "3000:3000"
networks:
- monitoring
prometheus:
image: prom/prometheus:v2.47.0
ports:
- "9090:9090"
configs:
- source: prometheus_config
target: /etc/prometheus/prometheus.yaml
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
deploy:
mode: global # one per node
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.5.0
deploy:
mode: global # one per node
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
networks:
- monitoring
myapp:
image: nginx:alpine
deploy:
replicas: 3 # run 3 containers
ports:
- "8081:80"
networks:
- monitoring
configs:
prometheus_config:
file: ./prometheus.yaml
4. Prometheus config for Swarm discovery¶
Unlike plain Compose, you donβt need to list every replica. You can scrape all tasks of a service:
prometheus.yaml
:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'cadvisor'
dns_sd_configs:
- names: ['tasks.cadvisor']
type: A
port: 8080
- job_name: 'node-exporter'
dns_sd_configs:
- names: ['tasks.node-exporter']
type: A
port: 9100
- job_name: 'myapp'
dns_sd_configs:
- names: ['tasks.myapp']
type: A
port: 80
π tasks.<service-name>
= DNS entry that resolves to all replicas of that service.
So if myapp
has 3 replicas, Prometheus scrapes all 3 automatically.
5. Deploy the stack¶
Check:
6. Verify¶
- Open Grafana β
http://localhost:3000
- Add Prometheus as a datasource β
http://prometheus:9090
- Query
up{job="myapp"}
β you should see 3 active targets (1 per replica). - Scale easily:
β Prometheus will now scrape 5 endpoints (automatically, no config change needed).
β Why Swarm is better for monitoring replicas¶
- Service discovery: Prometheus automatically finds all replicas via DNS (
tasks.<service>
). - Scaling: Add/remove replicas with one command, Prometheus adjusts.
- Global services: For node-level exporters like cAdvisor and Node Exporter,
deploy.mode: global
ensures 1 container per node.
β‘ Bottom line: In Swarm, Prometheus + Grafana can automatically discover all replicas of your app or exporters. You donβt have to hardcode each replica like in plain Compose. This makes it very close to real-world HA setups.
Our Setup¶
Final Files¶
1. docker-compose.yaml
¶
Use this Swarm-ready compose (note: container_name
is removed because Swarm manages names):
version: "3.8"
# --------------------------
# NETWORKS
# --------------------------
networks:
monitoring:
driver: overlay # Overlay is required for multi-node Swarm networks.
# Services on this network can talk to each other using DNS names.
# --------------------------
# VOLUMES
# --------------------------
volumes:
grafana_data: # Docker-managed volume for Grafana data
prometheus_data: # Docker-managed volume for Prometheus TSDB
# --------------------------
# CONFIGS
# --------------------------
configs:
prometheus_config:
file: ./prometheus/prometheus.yaml # External Prometheus config file mounted into the container.
# --------------------------
# SERVICES
# --------------------------
services:
# --------------------------
# GRAFANA (Visualization UI)
# --------------------------
grafana:
image: grafana/grafana:10.0.3 # Stable Grafana release.
restart: always # Restart if it crashes.
ports:
- "3000:3000" # Expose Grafana UI at http://<host>:3000
environment:
- GF_PANELS_DISABLE_SANITIZE_HTML=true # Allow richer HTML in dashboards.
- GF_SECURITY_ADMIN_USER=${GRAFANA_USER:-admin} # Admin username (default: admin).
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin} # Admin password (default: admin).
- GF_USERS_ALLOW_SIGN_UP=false # Prevent users from self-signup.
volumes:
- grafana_data:/var/lib/grafana # Persist Grafana data (dashboards, configs, users).
networks:
- monitoring
deploy:
replicas: 1 # Single Grafana instance is usually enough.
# --------------------------
# PROMETHEUS (Time-series DB)
# --------------------------
prometheus:
image: prom/prometheus:v2.47.0 # Stable Prometheus release.
restart: always
ports:
- "9090:9090" # Expose Prometheus UI at http://<host>:9090
command:
- "--config.file=/etc/prometheus/prometheus.yaml" # Path to config inside container.
- "--log.level=warn" # Reduce noise (options: debug, info, warn, error).
- "--storage.tsdb.path=/prometheus" # Persistent storage path inside container.
- "--storage.tsdb.retention.time=7d" # Retain metrics for 7 days.
volumes:
- prometheus_data:/prometheus # Persist time-series database to host.
configs:
- source: prometheus_config # Load Prometheus config (scrape jobs).
target: /etc/prometheus/prometheus.yaml
networks:
- monitoring
deploy:
replicas: 1 # Single Prometheus is fine for testing/labs.
# --------------------------
# CADVISOR (Container Metrics)
# --------------------------
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2 # Container-level metrics collector.
command: -logtostderr -docker_only
ports:
- "8080:8080" # Expose cAdvisor UI at http://<host>:8080 (optional).
volumes: # Mounts required to read container stats.
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
networks:
- monitoring
deploy:
mode: global # Run one cAdvisor per Swarm node automatically.
# --------------------------
# NODE EXPORTER (Host Metrics)
# --------------------------
node-exporter:
image: prom/node-exporter:v1.5.0 # Host-level metrics collector.
ports:
- "9100:9100" # Expose node-exporter metrics at /metrics
command: # Override defaults for containerized use.
- "--path.sysfs=/host/sys"
- "--path.procfs=/host/proc"
- "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)"
- "--no-collector.ipvs"
volumes: # Mount host /proc and /sys for metrics.
- /proc:/host/proc:ro
- /sys:/host/sys:ro
networks:
- monitoring
deploy:
mode: global # Run one node-exporter per Swarm node automatically.
# --------------------------
# NGINX EXPORTER (1 per cluster)
# --------------------------
nginx-exporter:
image: nginx/nginx-prometheus-exporter:0.11.0
command:
- -nginx.scrape-uri=http://nginx-app:80/stub_status
ports:
- "9113:9113" # Metrics endpoint β http://localhost:9113/metrics
networks:
- monitoring
deploy:
replicas: 1
# --------------------------
# SAMPLE APP (Replace with yours)
# --------------------------
nginx-app:
image: nginx:alpine
ports:
- "8081:80" # external port for testing in browser
networks:
- monitoring
volumes:
- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro
deploy:
replicas: 1 # run 3 replicas
restart_policy:
condition: on-failure
2. prometheus/prometheus.yaml
¶
# -------------------------------
# GLOBAL CONFIGURATION
# -------------------------------
global:
# How often Prometheus scrapes all targets.
# Default is 1m, but 30s gives higher resolution (more data points).
scrape_interval: 30s
# How often Prometheus evaluates alerting/recording rules.
# Should usually match scrape_interval unless you need faster alerting.
evaluation_interval: 30s
# -------------------------------
# SCRAPE CONFIGURATIONS
# -------------------------------
scrape_configs:
# -------------------------------
# NODE EXPORTER JOB
# -------------------------------
- job_name: 'node-exporter'
# In Swarm mode, each service gets DNS entries like "tasks.<service-name>".
# "tasks.node-exporter" resolves to all running replicas (1 per node in this case).
dns_sd_configs:
- names: ['tasks.node-exporter'] # Service name to resolve.
type: A # DNS record type (A = IPv4 addresses).
port: 9100 # Port where node-exporter exposes /metrics.
# -------------------------------
# CADVISOR JOB
# -------------------------------
- job_name: 'cadvisor'
# Same trick: "tasks.cadvisor" resolves to all replicas of cadvisor.
# Since cadvisor runs in "global" mode, that means 1 per node.
dns_sd_configs:
- names: ['tasks.cadvisor']
type: A
port: 8080 # cAdvisor default metrics port.
# -------------------------------
# PROMETHEUS SELF-MONITORING
# -------------------------------
- job_name: 'prometheus'
# Prometheus exposes its own metrics at /metrics (port 9090).
# Scraping itself lets you monitor its performance (scrape durations, TSDB health, memory, etc.).
static_configs:
- targets: ["prometheus:9090"] # "prometheus" resolves to the Prometheus service inside the Swarm overlay network.
# --------------------------
# SAMPLE APP EXPORTER (Replace with yours)
# --------------------------
- job_name: 'nginx'
dns_sd_configs:
- names: ['tasks.nginx-exporter']
type: A
port: 9113
π Notice:
- Instead of hardcoding
10.1.149.x
IPs, we usetasks.<service>
DNS for Swarm service discovery. - This way, Prometheus automatically finds all replicas on all nodes.
π Step-by-Step Setup¶
1. Initialize Swarm¶
On your manager node (your single machine is enough for practice):
2. Create directories¶
On the manager:
Copy the two files:
docker-compose.yaml
βmonitoring/docker-compose.yaml
prometheus.yaml
βmonitoring/prometheus/prometheus.yaml
3. Deploy the stack¶
From the monitoring/
directory:
4. Verify¶
docker stack ls
docker service ls
docker service ps monitoring_prometheus
docker service ps monitoring_cadvisor
5. Access dashboards¶
- Grafana β http://localhost:3000 (
admin/admin
) - Prometheus β http://localhost:9090
- Query
up
in Prometheus β youβll see node-exporter + cadvisor for all swarm nodes.
6. Import Grafana dashboards¶
- Add Prometheus as a datasource β
http://prometheus:9090
. -
Import dashboards:
-
Node Exporter Full (ID: 1860)
- Docker & cAdvisor (ID: 893)
β End Result¶
- Node Exporter (global service) β 1 instance per node.
- cAdvisor (global service) β 1 instance per node.
- Prometheus scrapes all automatically via DNS.
- Grafana visualizes with ready-made dashboards.
β‘ Now you can add any app as a Swarm service, and Prometheus can scrape it with the same tasks.<service>
trick.
Validation¶
Run:
If all the services are finally running cleanly we should see
monitoring_cadvisor global 1/1
monitoring_grafana replicated 1/1
monitoring_nginx-app replicated 1/1
monitoring_nginx-exporter replicated 1/1
monitoring_node-exporter global 1/1
monitoring_prometheus replicated 1/1
That means:
- β
Grafana is up (
http://localhost:3000
) - β
Prometheus is up (
http://localhost:9090
) - β
Node Exporter is running per node (
:9100
) - β
cAdvisor is running per node (
:8080
) - β
Nginx app is reachable (
http://localhost:8081
) - β
Nginx Prometheus Exporter is running (
http://localhost:9113/metrics
)
π Next steps (to confirm everything)¶
- Check Nginx
stub_status
:
You should see output like:
- Check Nginx Exporter metrics:
Should show metrics like:
-
Check Prometheus targets: Open http://localhost:9090/targets You should see:
-
node-exporter
targets (one per node) cadvisor
targets (one per node)prometheus
itself-
nginx
exporter target (9113
) -
Import Grafana dashboards:
-
Prometheus datasource:
http://prometheus:9090
-
Useful dashboards:
- Node Exporter Full β 1860
- Docker & cAdvisor β 893
- Nginx Prometheus Exporter β 12708
β‘ If all those checks succeed, your full monitoring stack is good to go: infra + containers + Nginx app.
β Automating Grafana Datasource & Dashboard Setup¶
Grafana supports provisioning via config files. This means you can pre-bake datasources and dashboards so that Grafana comes up ready to use β no manual setup.
1. Provision the Prometheus datasource¶
Create ./grafana/provisioning/datasources/datasource.yaml
:
apiVersion: 1 # Required: API version for Grafana provisioning configs.
datasources:
- name: Prometheus # Friendly name of the datasource (shown in Grafana).
type: prometheus # The datasource plugin type (Prometheus in this case).
access: proxy # Grafana proxies queries to Prometheus (recommended).
url: http://prometheus:9090 # Internal DNS name of Prometheus service in Swarm.
isDefault: true # Marks this datasource as the default one.
2. Provision dashboards¶
Create ./grafana/provisioning/dashboards/dashboard.yaml
:
apiVersion: 1 # Provisioning config schema version.
providers:
- name: 'default' # Provider name (arbitrary, but must be unique).
orgId: 1 # Organization ID (Grafana default org = 1).
folder: '' # Put dashboards into the root folder.
type: file # Provision from local files.
disableDeletion: false # Allow Grafana UI users to delete dashboards.
editable: true # Allow edits in the UI (wonβt overwrite files).
updateIntervalSeconds: 10 # Grafana checks every 10s for updated JSON files.
options:
path: /etc/grafana/dashboards # Directory where dashboards JSONs will live inside the container.
3. Download dashboard JSONs (from Grafana.com IDs)¶
Example: say you want 5 dashboards (replace with the IDs you need):
# Node Exporter Full (ID 1860, rev 27)
curl -L https://grafana.com/api/dashboards/1860/revisions/27/download \
-o ./grafana/dashboards/node-exporter.json
# NGINX (ID 12708, rev 1)
curl -L https://grafana.com/api/dashboards/12708/revisions/1/download \
-o ./grafana/dashboards/nginx.json
After this, your structure looks like:
grafana/
provisioning/
datasources/
datasource.yml # Preconfigured Prometheus datasource
dashboards/
dashboard.yml # Dashboard provisioning config
dashboards/
node-exporter.json # Downloaded dashboards
swarm.json
nginx.json
4. Update the Grafana service in your stack file¶
grafana:
...
volumes:
- grafana_data:/var/lib/grafana
# Mount provisioning configs into the container (datasources + dashboards).
- ./grafana/provisioning:/etc/grafana/provisioning
# Mount the dashboards directory with JSON files.
- ./grafana/dashboards:/etc/grafana/dashboards
...
5. Deploy¶
Now when you run:
Grafana will come up with:
- Prometheus datasource ready to use.
- 2 dashboards preloaded (Node Exporter, Nginx).
- Fully repeatable in any environment (dev, staging, prod).
βοΈ Best Practice Tip: We keep the JSON dashboards version-controlled in our repo so we know exactly which revisions are deployed. We update manually to newer dashboard versions.