From first metric to full observability stack
Master modern observability with OpenTelemetry, Prometheus, Grafana, Loki, and Tempo. Learn to instrument applications, build dashboards, correlate signals, define SLOs, and debug production systems.
All lessons run locally using Docker containers. No cloud accounts or paid services required.
Progress through these modules to master monitoring and observability. Click any module to expand and view lessons.
5 modules β’ 1 open
Understand the three pillars of observability (metrics, logs, traces), learn the Four Golden Signals, RED and USE methods, and see why observability matters for modern systems. Starting state: nothing required. After this lesson: conceptual foundation for the course, plus a quick taste of real Prometheus metrics.
Build a local observability stack with Docker Compose. You'll run Prometheus for metrics collection, Grafana for visualization, and an OpenTelemetry Collector to receive and route telemetry data. Starting state: Docker installed, no prior lab. After this lesson: ~/observability-lab/ running with Prometheus (:9090), Grafana (:3001), and OTel Collector (:4317/:4318/:8889). This lab directory is used for every lesson from 102 through 405.
Instrument a Node.js application with the OpenTelemetry SDK. You'll learn the OTel architecture, add auto-instrumentation for traces and metrics, configure exporters to send data to your OTel Collector, and see real telemetry flowing through your observability stack. Starting state: ~/observability-lab/ running from Lesson 102 (Prometheus, Grafana, OTel Collector). After this lesson: a demo Node.js app at ~/observability-lab/app/ sending metrics and traces to your stack.
Learn Prometheus metric types and PromQL, then build a Grafana dashboard from scratch. You'll create panels for request rate, error rate, and latency percentiles, and assemble a Four Golden Signals dashboard. Starting state: ~/observability-lab/ running with demo app from Lesson 103. After this lesson: a Four Golden Signals dashboard in Grafana, completing Module 1.
Master the Prometheus Query Language from selectors and matchers to advanced aggregations. Learn the difference between instant and range vectors, use rate() and histogram_quantile(), and build production-ready dashboard queries. Starting state: ~/observability-lab/ running with Prometheus, Grafana, OTel Collector, and demo app from Module 1. After this lesson: you can write PromQL queries for any metric in your stack.
Add custom application metrics to your existing demo app using the OpenTelemetry SDK. You'll create counters, gauges, and histograms for business logic, verify them in Prometheus, and build Grafana dashboards. Starting state: ~/observability-lab/ running from Module 1 (Prometheus, Grafana, OTel Collector, demo app code at ~/observability-lab/app/). After this lesson: demo app enhanced with custom metrics (app_http_requests_total, app_orders_processed_total, app_active_connections, app_http_request_duration_seconds) and new endpoints (/order, /slow, /error).
Deploy Node Exporter for host metrics and Blackbox Exporter for synthetic endpoint monitoring. Write Prometheus recording rules to pre-compute expensive queries for fast dashboards and reliable alerting.
Set up Prometheus Alertmanager for production alerting. Write alerting rules based on symptoms, configure routing and receivers, understand grouping, inhibition, and silencing, and learn best practices to avoid alert fatigue.
Learn why structured JSON logs are essential for observability, how to include trace context for correlation, and how to configure logging levels, context propagation, and the OpenTelemetry log bridge API.
Deploy Grafana Loki for centralized log aggregation, configure the OTel Collector to export logs to Loki, learn LogQL query language, and build log exploration dashboards in Grafana.
Deploy Grafana Tempo to collect and query distributed traces, configure the OTel Collector to export traces via OTLP, learn to read span waterfalls, apply sampling strategies, and troubleshoot slow requests using trace data.
Master the art of cross-signal correlation in Grafana β link metrics to traces via exemplars, navigate from logs to traces via trace_id, and debug incidents using all three observability pillars together.
Learn to define Service Level Indicators (SLIs), set Service Level Objectives (SLOs), and manage error budgets to make data-driven reliability decisions.
Learn dashboard design principles using Google's Golden Signals and Brendan Gregg's USE method layouts, choose the right visualizations, measure alert quality, and provision dashboards as code.
Learn to manage observability costs and performance by controlling metric cardinality, configuring trace and log sampling, setting retention policies, and capacity planning for your monitoring infrastructure.
Learn to use observability tools for rapid incident detection, triage, investigation, and resolution. Build runbooks, practice structured postmortems, and foster a blameless culture.
Learn to version-control all monitoring configuration β Grafana dashboards, Prometheus rules, Alertmanager routing, and OTel Collector pipelines β using GitOps workflows and Terraform.
Work through these lessons at your own pace. Each step includes commands, explanations, and expected outcomes.