π§Ύ DevOps Dashboard Cheat Sheet¶
How to read dashboards & diagnose issues like a pro
π§© The Big Picture β Why Dashboards Matter¶
Dashboards are not just βpretty graphsβ β they answer 3 core questions:
- Is it up? β availability (can users access it?)
- Is it fast? β performance (is it responsive?)
- Is it healthy? β resource health (can it stay running?)
π Always think in terms of symptoms β cause. For example: βWebsite is slowβ (symptom) β βdatabase CPU 100%β (cause).
π’ Core Areas to Monitor¶
1. CPU¶
-
What to look for:
-
% usage
per core and overall. - Breakdown: system, user, iowait.
-
Rules of thumb:
-
<70% sustained β healthy.
-
90% sustained β CPU bottleneck.
-
What it means:
-
High user CPU β app code is working hard (expected under load).
- High system CPU β kernel, syscalls, maybe networking overhead.
- High iowait β CPU is waiting on disk/network β possible I/O bottleneck.
-
Actions:
-
Add replicas / scale service.
- Profile app (optimize queries, caching).
- Investigate I/O subsystem if iowait high.
2. Memory (RAM)¶
-
What to look for:
-
Total used vs total available.
- Cache/buffers vs actual application usage.
- Swap usage.
-
Rules of thumb:
-
If memory is βfullβ but mostly cache β not a problem (Linux uses free RAM for caching).
- If swap is active β real memory pressure.
-
What it means:
-
High app usage β possible memory leak or undersized instance.
- High swap usage β system thrashing, huge slowdown.
-
Actions:
-
Restart leaking service.
- Add more RAM.
- Optimize memory-heavy queries.
3. Disk / Storage¶
-
Metrics:
-
Disk usage %.
- IOPS (reads/writes per second).
- Latency (avg read/write ms).
-
Rules of thumb:
-
80% disk full β plan cleanup/expansion.
- Latency >20ms on SSDs β bottleneck.
-
What it means:
-
High latency + high iowait β storage bottleneck.
- High disk usage β risk of system crash when full.
-
Actions:
-
Clear logs / rotate.
- Scale to larger disks.
- Use faster storage (NVMe, SSD).
4. Network¶
-
Metrics:
-
Bandwidth in/out.
- Packet drops/errors.
- Latency / RTT.
-
Rules of thumb:
-
Bandwidth near link limit (e.g. 1Gbps NIC at 950Mbps) β saturation.
- Packet drops/errors >0 β network health issue.
-
What it means:
-
High outbound traffic β app serving lots of data (normal or DDoS).
- Latency spikes β congestion, routing problems.
-
Actions:
-
Scale horizontally (more nodes).
- Throttle heavy clients.
- Investigate load balancer.
5. Containers / Pods¶
-
Metrics:
-
CPU & memory per container.
- Container restarts (counter).
- Pod status (running, crashloop).
-
Red flags:
-
Containers restarting repeatedly β crashloop, misconfiguration.
- CPU/memory throttling in Kubernetes.
-
What it means:
-
Misconfigured resource limits.
- App bugs (OOMKilled, segfaults).
-
Actions:
-
Check logs for container crash cause.
- Adjust requests/limits in K8s.
- Add replicas for load.
6. Application Level¶
-
Key metrics (the βGolden Signalsβ from Google SRE):
-
Latency β how long requests take.
- Traffic β requests per second.
- Errors β % of failed requests.
- Saturation β how βfullβ the system is (queues, memory).
-
Interpretation:
-
High latency + high errors β app/service bottleneck.
- High latency + low CPU/memory β external dependency issue.
- High traffic spikes β expected? or DDoS?
-
Actions:
-
Scale service horizontally.
- Add caching layer.
- Optimize slow queries.
7. Databases¶
-
Metrics:
-
Query throughput (QPS).
- Query latency.
- Locks, deadlocks.
- Buffer cache hit ratio.
-
Red flags:
-
Slow queries β long latency spikes.
- Lock waits β contention.
-
Actions:
-
Add indexes.
- Optimize queries.
- Add read replicas.
8. Logs & Events¶
- Dashboards often link to logs (via Loki, ELK, etc.).
- Use them to confirm the why behind metrics.
π¨ Diagnosis by Symptom¶
Symptom | Likely Cause | Where to Look |
---|---|---|
High latency (slow site) | CPU/memory saturated, DB slow, network congestion | CPU, memory, DB dashboards |
Frequent 500 errors | App crash, DB errors, bad config | App logs, DB metrics |
Nodes going down | Out of memory, disk full, network partition | Node exporter, disk usage |
Container restarts | Misconfig, OOMKilled, bad healthcheck | Container/pod dashboards |
Traffic spike | Legit user load vs DDoS | Network + load balancer metrics |
Disk full alerts | Logs, data growth, temp files | Disk usage dashboard |
π οΈ Method: How to Read a Dashboard Like a Pro¶
-
Start broad β drill down
-
Begin with system overview (CPU/mem/disk).
- Narrow to container/pod β app β DB.
-
Look for correlations
-
High CPU at same time as latency spikes?
- High iowait + disk latency β storage problem.
-
Timeline matters
-
Spikes vs sustained trends tell different stories.
-
Always check for βinnocent victimsβ
-
If all pods restart at once β node issue, not app bug.
π― World-Class Habits¶
- Always correlate metrics + logs.
- Watch rate of change, not just absolute numbers (e.g., 10GB logs written/hour).
- Build mental models: Traffic β β CPU β β Latency β is expected. If not, dig deeper.
- Treat dashboards as hypothesis tools, not truth β confirm with logs, traces, configs.
π TL;DR Cheatsheet¶
- CPU >90% sustained β bottleneck.
- Memory + swap high β thrashing.
- Disk >80% full β expand.
- Disk latency >20ms (SSD) β bottleneck.
- Network drops/errors β faulty NIC or congestion.
- Container restarts β crashloop (logs!).
- Latency + errors β app/db issue.
- Traffic spike β scale or DDoS check.
π§ Troubleshooting Flow β From Alert β Root Cause¶
flowchart TD
A["ALERT or User Complaint"] --> B{"What is the symptom?"}
B -->|Slow responses / High latency| C["Check Traffic Dashboard"]
B -->|Errors or Crashes| D["Check Application Dashboard"]
B -->|Node or Container down| E["Check Node/Pod Dashboard"]
B -->|Disk Full| F["Check Disk Dashboard"]
%% Latency path
C --> C1{"Is traffic normal?"}
C1 -->|Yes| C2["Check CPU & Memory"]
C1 -->|No - Spike| C3["Check scaling"]
C2 -->|CPU > 90%| C4["Scale out or optimize app"]
C2 -->|Memory/Swap high| C5["Possible leak or undersized - fix or resize"]
C2 -->|Both OK| C6["Check Database latency"]
C3 -->|Scaling works| C4
C3 -->|Scaling blocked| C7["Add caching/CDN, rate limiting"]
C6 -->|DB slow| C8["Optimize queries, add replicas"]
C6 -->|DB fine| C9["Check network"]
C9 -->|Packet loss high| C10["Investigate NIC / load balancer"]
C9 -->|Network fine| C11["Check external dependencies"]
%% Errors path
D --> D1{"Error type?"}
D1 -->|HTTP 500s| D2["Check logs - DB errors/config issues"]
D1 -->|CrashLoop| D3["Check container memory/CPU limits"]
D1 -->|OOMKilled| D4["Raise memory or fix leaks"]
%% Node/Pod path
E --> E1{"Which failed?"}
E1 -->|Node down| E2["Check disk, power, kernel logs"]
E1 -->|Pod down| E3["Check healthchecks, events, logs"]
%% Disk path
F --> F1{"Disk usage > 80%?"}
F1 -->|Yes - Logs| F2["Check log growth, tmp files"]
F1 -->|Yes - Capacity| F3["Rotate/compress logs, expand disk"]
F1 -->|No but I/O high| F4["Check IOPS & latency"]
π§© How to Use This Flow¶
-
Start at the Symptom
-
Alert: high latency, errors, disk full, pod restarts.
-
User complaint: βsite is slowβ / βapp keeps crashingβ.
-
Pick the Path
-
Latency β traffic β CPU/mem β DB β network.
- Errors β logs β DB/app configs β containers.
- Container down β check limits β healthchecks β logs.
-
Disk full β check log growth β cleanup β expand.
-
Correlate Across Layers
-
Example: high latency + high CPU = bottleneck.
- Example: high latency + normal CPU/mem = likely DB or network.
π Key Dashboards to Check Along the Way¶
Step | Dashboard / Metric | What It Tells You |
---|---|---|
Traffic spike | Prometheus http_requests_total | Is the load abnormal? |
CPU usage | Node exporter node_cpu_seconds_total | Is system CPU bound? |
Memory & Swap | Node exporter node_memory_MemAvailable | Is system thrashing? |
Containers | cAdvisor / K8s pod metrics | Is a pod OOMKilled or throttled? |
DB latency | postgres_exporter / mysql_exporter | Are queries slow? |
Disk usage | node_exporter filesystem | Disk nearly full? |
Disk I/O | node_disk_read/write_time | Storage bottleneck? |
Network | node_network_errors_total | NIC drops/retransmits? |
Logs | Loki / ELK / docker logs | The βwhyβ behind the metrics. |
π οΈ Example Walkthroughs with Flow¶
Example 1: Slow Website¶
- Latency up β traffic spike.
- CPU at 95% β bottleneck.
- Fix: scale replicas from 3 β 6.
Example 2: High Error Rate¶
- Errors are HTTP 500.
- Logs: βDB connection failedβ.
- DB connections maxed.
- Fix: increase pool size, optimize queries.
Example 3: Container Restarting¶
- Pod in CrashLoop.
- Container OOMKilled.
- Fix: raise memory limit, tune heap size.
Example 4: Disk Full¶
- Disk usage 95%.
/var/log
growing fast.- Fix: rotate logs, expand disk.
π― DevOps Mindset¶
- Always think in layers: User β Load Balancer β App β DB β OS β Hardware/Network.
- Use dashboards to narrow down which layer is misbehaving.
- Confirm with logs to know why.
- Fix immediate issue β plan long-term solution.
π οΈ Troubleshooting Examples with Dashboards¶
π₯ Scenario 1: Website is Slow (High Latency)¶
Symptom:¶
- Users report: βSite feels sluggishβ.
- Grafana shows latency rising above 2s.
Step 1: Check Traffic¶
- Dashboard: Request rate (Prometheus
http_requests_total
). - Observation: Traffic is 2Γ higher than normal.
- π Spike in demand.
Step 2: Check CPU¶
- Dashboard: Node exporter CPU usage.
- Observation: CPU at 95%.
- π Server is CPU-bound.
Step 3: Correlate with App¶
- Dashboard: Container resource usage (cAdvisor).
- Observation: The web app container is consuming most CPU.
Likely Cause:¶
- Increased load β CPU saturated.
Fix:¶
- Scale replicas from 3 β 6.
- Add caching (e.g., Cloudflare / Redis).
π₯ Scenario 2: High Error Rate (HTTP 500s)¶
Symptom:¶
- Alert: β5% of requests failing with 500 errorsβ.
Step 1: Check Application Logs¶
- Dashboard link: Loki/ELK.
- Observation: βDB connection failed: too many connectionsβ.
Step 2: Check Database Metrics¶
- Dashboard: PostgreSQL exporter.
- Observation: Connection count at max (100).
Step 3: Check Query Latency¶
- Observation: Queries piling up, high wait time.
Likely Cause:¶
- DB connection pool exhausted.
Fix:¶
- Increase DB connection pool size.
- Optimize slow queries.
- Add a DB read replica if traffic keeps growing.
π₯ Scenario 3: Container Keeps Restarting¶
Symptom:¶
- Alert: βApp container restarting repeatedly (CrashLoopBackOff)β.
Step 1: Check Container Restarts¶
- Dashboard: cAdvisor / Kubernetes pod metrics.
- Observation: Container restarted 12 times in 5 minutes.
Step 2: Check Memory Usage¶
- Observation: Container hits memory limit (512MB), then OOMKilled.
Step 3: Check Logs¶
- Observation: Java app error: βOutOfMemoryError: Java heap spaceβ.
Likely Cause:¶
- App exceeds container memory limits.
Fix:¶
- Increase memory limit to 1GB.
- Tune JVM heap settings.
- Monitor for leaks.
π₯ Scenario 4: Disk Full¶
Symptom:¶
- Alert: βDisk usage > 90%β.
Step 1: Check Disk Usage¶
- Dashboard: Node exporter filesystem metrics.
- Observation:
/var/log
partition at 95%.
Step 2: Drill Into Log Volume¶
- Observation: App logs growing at 1GB/hour.
Step 3: Correlate With Traffic¶
- Observation: Error logs spiking with traffic surge.
Likely Cause:¶
- App spamming logs due to repeated errors.
Fix:¶
- Fix root cause of error.
- Rotate/compress logs.
- Increase disk size if needed.
π₯ Scenario 5: Network Latency Spikes¶
Symptom:¶
- Alert: βP95 latency > 500msβ.
Step 1: Check Network Dashboard¶
- Observation: High packet retransmits on one node.
Step 2: Check Node Health¶
- Observation: NIC errors increasing.
Step 3: Correlate With Service¶
- Observation: Only services on node-2 affected.
Likely Cause:¶
- Faulty NIC or driver issue on node-2.
Fix:¶
- Drain node-2 (
kubectl cordon node-2
). - Replace or troubleshoot hardware.
π₯ Scenario 6: Prometheus Missing Metrics¶
Symptom:¶
- Dashboard panel blank.
Step 1: Check Prometheus Targets¶
- Dashboard: Prometheus
/targets
UI. - Observation: Target
node-exporter:9100
isDOWN
.
Step 2: Check Service¶
- Observation: node-exporter container stopped.
Step 3: Check Logs¶
- Observation: Error binding to port
9100
.
Likely Cause:¶
- Port conflict or container crash.
Fix:¶
- Restart node-exporter.
- Ensure no other process using 9100.
π Quick Troubleshooting Playbook¶
Symptom | Where to Look | Common Causes | Actions |
---|---|---|---|
High latency | CPU, DB latency | CPU saturation, slow queries, network congestion | Scale out, add cache, optimize queries |
High error rate | Logs, app metrics | DB pool exhausted, app crash, misconfig | Fix DB, check app health |
Container restarts | Container metrics, logs | OOMKilled, crashloop, bad healthcheck | Increase limits, debug app |
Disk full | Node exporter disk | Log growth, data not rotated | Cleanup, rotate logs |
Network drops | NIC stats, node dashboard | Faulty NIC, congestion | Replace NIC, move workload |
Missing metrics | Prometheus targets | Exporter down, scrape failure | Restart exporter, fix config |
β With this playbook:
- Start at symptom (latency, errors, restarts).
- Drill into dashboards (CPU/mem/disk/net/app).
- Correlate across layers (infra β container β app β DB).
- Confirm with logs.
- Apply fix or scale.