Observability Complete Learning Roadmap - Metrics, Logging & Tracing¶
🎯 Mục tiêu: Master Observability với công cụ phổ biến trong 10-12 tháng¶
Công cụ tập trung:
- Metrics & Alerting: Prometheus + Grafana
- Logging: ELK Stack (Elasticsearch + Logstash + Kibana)
- Tracing & APM: Jaeger + OpenTelemetry
🔰 GIAI ĐOẠN 1: OBSERVABILITY FOUNDATION (Tháng 1-2)¶
Week 1: Observability Fundamentals¶
Search Keywords để học:
- [ ] "What is observability in DevOps"
- [ ] "Three pillars of observability metrics logs traces"
- [ ] "Monitoring vs observability difference"
- [ ] "SRE monitoring best practices"
- [ ] "Observability strategy for microservices"
- [ ] "Golden signals SLI SLO SLA explained"
- [ ] "Observability tools comparison 2024"
- [ ] "Observability implementation roadmap"
Core Concepts:
- [ ] Four Golden Signals: Latency, Traffic, Errors, Saturation
- [ ] RED Method: Rate, Errors, Duration
- [ ] USE Method: Utilization, Saturation, Errors
- [ ] Service Level Indicators (SLI)
- [ ] Service Level Objectives (SLO)
- [ ] Error budgets và alerting
Week 2: Metrics Fundamentals¶
Search Keywords để học:
- [ ] "Application metrics types counter gauge histogram"
- [ ] "Infrastructure metrics CPU memory disk network"
- [ ] "Business metrics KPI monitoring"
- [ ] "Metrics collection strategies"
- [ ] "Time series data explained"
- [ ] "Metrics cardinality problems"
- [ ] "Metrics aggregation and downsampling"
- [ ] "Metrics retention policies"
Metrics Types Practice:
# Counter - monotonically increasing
http_requests_total
database_connections_created_total
# Gauge - can go up and down
cpu_usage_percent
memory_usage_bytes
active_connections
# Histogram - distribution of values
http_request_duration_seconds
database_query_duration_seconds
# Summary - similar to histogram
response_time_summary
request_size_summary
Week 3: Logging Fundamentals¶
Search Keywords để học:
- [ ] "Structured logging vs unstructured logging"
- [ ] "Log levels ERROR WARN INFO DEBUG TRACE"
- [ ] "Centralized logging architecture"
- [ ] "Log aggregation strategies"
- [ ] "Log parsing and enrichment"
- [ ] "Log retention and archiving"
- [ ] "Logging best practices security"
- [ ] "Log correlation and context"
Logging Best Practices:
// Structured logging example
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "user-service",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user123",
"message": "Failed to authenticate user",
"error": "invalid_credentials",
"duration_ms": 145
}
Week 4: Tracing Fundamentals¶
Search Keywords để học:
- [ ] "Distributed tracing explained"
- [ ] "OpenTelemetry tracing concepts"
- [ ] "Trace spans and context propagation"
- [ ] "Tracing in microservices architecture"
- [ ] "Trace sampling strategies"
- [ ] "Trace correlation with logs and metrics"
- [ ] "Application performance monitoring APM"
- [ ] "Tracing overhead and performance impact"
Tracing Concepts:
Trace: Complete request journey
├── Span: Individual operation
│ ├── Tags: Key-value metadata
│ ├── Logs: Timestamped events
│ └── Context: Propagation info
└── Span: Next operation
└── Child Span: Nested operation
📊 GIAI ĐOẠN 2: PROMETHEUS & GRAFANA MASTERY (Tháng 3-4)¶
Week 5: Prometheus Fundamentals¶
Search Keywords để học:
- [ ] "Prometheus installation setup tutorial"
- [ ] "Prometheus configuration prometheus.yml"
- [ ] "Prometheus data model time series"
- [ ] "Prometheus metrics types counter gauge histogram"
- [ ] "Prometheus scraping targets discovery"
- [ ] "Prometheus exporters node_exporter blackbox"
- [ ] "Prometheus recording rules and alerts"
- [ ] "Prometheus federation and high availability"
Prometheus Setup:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- "alertmanager:9093"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "my-app"
static_configs:
- targets: ["app:8080"]
Week 6: PromQL Query Language¶
Search Keywords để học:
- [ ] "PromQL tutorial Prometheus query language"
- [ ] "PromQL selectors and matchers"
- [ ] "PromQL aggregation operators sum avg max"
- [ ] "PromQL functions rate increase irate"
- [ ] "PromQL range queries and instant queries"
- [ ] "PromQL histogram quantile calculations"
- [ ] "PromQL troubleshooting common errors"
- [ ] "PromQL performance optimization"
PromQL Examples:
# Basic queries
up
http_requests_total
cpu_usage{instance="server1"}
# Rate calculations
rate(http_requests_total[5m])
increase(http_requests_total[1h])
# Aggregations
sum by (instance) (cpu_usage)
avg_over_time(cpu_usage[1h])
# Advanced queries
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Alerting queries
rate(http_requests_total{status="500"}[5m]) > 0.1
Week 7: Grafana Fundamentals¶
Search Keywords để học:
- [ ] "Grafana installation and setup"
- [ ] "Grafana data sources configuration"
- [ ] "Grafana dashboard creation tutorial"
- [ ] "Grafana panel types graph stat table"
- [ ] "Grafana templating and variables"
- [ ] "Grafana alerting rules setup"
- [ ] "Grafana plugins and extensions"
- [ ] "Grafana user management and permissions"
Grafana Dashboard JSON:
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{instance}}"
}
]
}
],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(up, instance)"
}
]
}
}
}
Week 8: Alertmanager & Alerting¶
Search Keywords để học:
- [ ] "Prometheus Alertmanager configuration"
- [ ] "AlertManager routing and grouping"
- [ ] "AlertManager notification channels email slack"
- [ ] "AlertManager silence and inhibition rules"
- [ ] "Prometheus alerting rules best practices"
- [ ] "Alert fatigue prevention strategies"
- [ ] "SRE alerting philosophy"
- [ ] "On-call management with AlertManager"
Alerting Rules:
# alerts.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value }} requests per second"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.instance }}"
📋 GIAI ĐOẠN 3: ELK STACK MASTERY (Tháng 5-6)¶
Week 9: Elasticsearch Fundamentals¶
Search Keywords để học:
- [ ] "Elasticsearch installation and configuration"
- [ ] "Elasticsearch cluster setup nodes shards"
- [ ] "Elasticsearch indices and mappings"
- [ ] "Elasticsearch query DSL tutorial"
- [ ] "Elasticsearch aggregations and analytics"
- [ ] "Elasticsearch index lifecycle management"
- [ ] "Elasticsearch security X-Pack authentication"
- [ ] "Elasticsearch performance tuning optimization"
Elasticsearch Setup:
# Elasticsearch installation
docker run -d --name elasticsearch \
-p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
-e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
elasticsearch:8.11.0
# Create index with mapping
curl -X PUT "localhost:9200/logs" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"service": { "type": "keyword" },
"trace_id": { "type": "keyword" }
}
}
}'
Week 10: Logstash Configuration¶
Search Keywords để học:
- [ ] "Logstash installation and configuration"
- [ ] "Logstash input plugins file beats http"
- [ ] "Logstash filter plugins grok mutate date"
- [ ] "Logstash output plugins elasticsearch stdout"
- [ ] "Logstash grok patterns for log parsing"
- [ ] "Logstash performance tuning pipeline"
- [ ] "Logstash monitoring and troubleshooting"
- [ ] "Logstash alternatives Fluentd Vector"
Logstash Configuration:
# logstash.conf
input {
beats {
port => 5044
}
file {
path => "/var/log/app/*.log"
start_position => "beginning"
}
}
filter {
if [fields][service] == "nginx" {
grok {
match => {
"message" => "%{NGINXACCESS}"
}
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
mutate {
convert => { "response_code" => "integer" }
convert => { "bytes" => "integer" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
stdout { codec => rubydebug }
}
Week 11: Kibana Analytics & Visualization¶
Search Keywords để học:
- [ ] "Kibana installation and setup"
- [ ] "Kibana index patterns and field mapping"
- [ ] "Kibana Discover log searching and filtering"
- [ ] "Kibana Visualize charts graphs and tables"
- [ ] "Kibana Dashboard creation and sharing"
- [ ] "Kibana Canvas for custom presentations"
- [ ] "Kibana alerting and notifications"
- [ ] "Kibana security and user management"
Kibana Query Examples:
// Kibana Query DSL
{
"query": {
"bool": {
"must": [
{
"match": {
"service": "user-service"
}
},
{
"range": {
"timestamp": {
"gte": "now-1h"
}
}
}
],
"filter": [
{
"term": {
"level": "ERROR"
}
}
]
}
}
}
Week 12: Beats & Log Shipping¶
Search Keywords để học:
- [ ] "Elastic Beats overview Filebeat Metricbeat"
- [ ] "Filebeat configuration log shipping"
- [ ] "Metricbeat system metrics collection"
- [ ] "Heartbeat uptime monitoring"
- [ ] "Packetbeat network monitoring"
- [ ] "Beats modules and processors"
- [ ] "Beats output configuration Elasticsearch Logstash"
- [ ] "Beats monitoring and troubleshooting"
Filebeat Configuration:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: myapp
environment: production
fields_under_root: true
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_docker_metadata: ~
output.logstash:
hosts: ["logstash:5044"]
# Alternative direct to Elasticsearch
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "logs-%{+yyyy.MM.dd}"
logging.level: info
logging.to_files: true
🔍 GIAI ĐOẠN 4: TRACING & APM MASTERY (Tháng 7-8)¶
Week 13: OpenTelemetry Fundamentals¶
Search Keywords để học:
- [ ] "OpenTelemetry overview and architecture"
- [ ] "OpenTelemetry SDK installation setup"
- [ ] "OpenTelemetry auto-instrumentation"
- [ ] "OpenTelemetry manual instrumentation"
- [ ] "OpenTelemetry collector configuration"
- [ ] "OpenTelemetry exporters Jaeger Zipkin"
- [ ] "OpenTelemetry sampling strategies"
- [ ] "OpenTelemetry context propagation"
OpenTelemetry Auto-Instrumentation:
# Python application
from opentelemetry import trace
from opentelemetry.auto_instrumentation import sitecustomize
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Manual instrumentation
@tracer.start_as_current_span("process_request")
def process_request(user_id):
span = trace.get_current_span()
span.set_attribute("user.id", user_id)
# Your business logic here
result = database_call(user_id)
span.set_attribute("result.count", len(result))
return result
Week 14: Jaeger Distributed Tracing¶
Search Keywords để học:
- [ ] "Jaeger installation and deployment"
- [ ] "Jaeger architecture collector agent query"
- [ ] "Jaeger UI trace analysis and debugging"
- [ ] "Jaeger sampling strategies configuration"
- [ ] "Jaeger storage backends Elasticsearch Cassandra"
- [ ] "Jaeger performance tuning and scaling"
- [ ] "Jaeger integration with Kubernetes"
- [ ] "Jaeger vs Zipkin comparison"
Jaeger Docker Setup:
# Jaeger all-in-one
docker run -d --name jaeger \
-p 16686:16686 \
-p 14268:14268 \
-p 6831:6831/udp \
-p 6832:6832/udp \
jaegertracing/all-in-one:latest
# Production setup with Elasticsearch
docker run -d --name jaeger-collector \
-p 14267:14267 \
-p 14268:14268 \
-p 9411:9411 \
-e SPAN_STORAGE_TYPE=elasticsearch \
-e ES_SERVER_URLS=http://elasticsearch:9200 \
jaegertracing/jaeger-collector:latest
Week 15: Application Performance Monitoring¶
Search Keywords để học:
- [ ] "APM implementation strategies"
- [ ] "Application tracing best practices"
- [ ] "Performance bottleneck identification"
- [ ] "Database query tracing and optimization"
- [ ] "API performance monitoring"
- [ ] "Error tracking and debugging"
- [ ] "User experience monitoring RUM"
- [ ] "Mobile application performance monitoring"
APM Instrumentation Examples:
// Node.js Express application
const express = require("express");
const { trace, context } = require("@opentelemetry/api");
const app = express();
const tracer = trace.getTracer("my-app");
app.get("/api/users/:id", async (req, res) => {
const span = tracer.startSpan("get_user");
try {
span.setAttributes({
"user.id": req.params.id,
"http.method": req.method,
"http.url": req.url,
});
const user = await getUserFromDatabase(req.params.id);
span.setAttributes({
"user.found": !!user,
"db.query.duration": 150, // ms
});
res.json(user);
} catch (error) {
span.setStatus({
code: trace.SpanStatusCode.ERROR,
message: error.message,
});
span.setAttributes({
"error.type": error.constructor.name,
"error.message": error.message,
});
res.status(500).json({ error: "Internal Server Error" });
} finally {
span.end();
}
});
Week 16: Advanced Observability Patterns¶
Search Keywords để học:
- [ ] "Observability correlation metrics logs traces"
- [ ] "Distributed tracing sampling optimization"
- [ ] "Observability data pipeline architecture"
- [ ] "Cost optimization observability data"
- [ ] "Observability as Code GitOps monitoring"
- [ ] "Service mesh observability Istio Linkerd"
- [ ] "Chaos engineering with observability"
- [ ] "Observability security and privacy"
🚀 GIAI ĐOẠN 5: ADVANCED OBSERVABILITY (Tháng 9-12)¶
Week 17-18: Production Observability Setup¶
Search Keywords để học:
- [ ] "Production monitoring architecture design"
- [ ] "Multi-cluster observability setup"
- [ ] "Observability high availability and disaster recovery"
- [ ] "Cross-region monitoring and alerting"
- [ ] "Observability infrastructure scaling"
- [ ] "Monitoring infrastructure costs optimization"
- [ ] "Compliance and audit logging"
- [ ] "Zero-downtime observability upgrades"
Week 19-20: Advanced Analytics & ML¶
Search Keywords để học:
- [ ] "Machine learning for anomaly detection"
- [ ] "Predictive alerting and forecasting"
- [ ] "Log analysis with machine learning"
- [ ] "Automated root cause analysis"
- [ ] "Behavioral monitoring and profiling"
- [ ] "Pattern recognition in observability data"
- [ ] "AI-powered incident response"
- [ ] "Observability data science applications"
Week 21-22: Observability Automation¶
Search Keywords để học:
- [ ] "Infrastructure as Code for monitoring"
- [ ] "Automated dashboard generation"
- [ ] "Dynamic alerting rule management"
- [ ] "Self-healing systems with observability"
- [ ] "Automated incident escalation"
- [ ] "GitOps for observability configuration"
- [ ] "Policy-driven monitoring deployment"
- [ ] "Observability testing automation"
Week 23-24: Observability Leadership¶
Search Keywords để học:
- [ ] "Observability strategy and governance"
- [ ] "Building observability culture in teams"
- [ ] "Observability ROI measurement"
- [ ] "Vendor evaluation and tool selection"
- [ ] "Observability training and education"
- [ ] "Industry observability best practices"
- [ ] "Future trends in observability"
- [ ] "Observability conference talks and papers"
📚 HANDS-ON PROJECTS¶
📊 Prometheus + Grafana Projects¶
-
Complete Monitoring Stack
-
Multi-service application monitoring
- Custom metrics and dashboards
- Comprehensive alerting setup
-
Search: "Prometheus Grafana monitoring stack"
-
Kubernetes Monitoring
-
Cluster and pod monitoring
- Resource usage tracking
- Application performance monitoring
-
Search: "Kubernetes Prometheus monitoring"
-
Business Metrics Dashboard
- KPI and business metrics
- Real-time analytics
- Executive dashboards
- Search: "Business metrics monitoring Grafana"
📋 ELK Stack Projects¶
-
Centralized Logging Platform
-
Multi-application log aggregation
- Log parsing and enrichment
- Security and audit logging
-
Search: "ELK stack centralized logging"
-
Log Analytics and Investigation
-
Error pattern analysis
- Performance troubleshooting
- Security incident investigation
-
Search: "ELK stack log analysis"
-
Compliance and Audit Logging
- Regulatory compliance logging
- Audit trail management
- Data retention policies
- Search: "ELK compliance audit logging"
🔍 Tracing Projects¶
-
Microservices Tracing
-
End-to-end request tracing
- Performance bottleneck identification
- Error propagation analysis
-
Search: "Jaeger microservices tracing"
-
Database Performance Monitoring
-
Query performance tracing
- Connection pool monitoring
- Database bottleneck analysis
-
Search: "Database tracing OpenTelemetry"
-
API Performance Optimization
- API response time analysis
- Third-party service monitoring
- Performance optimization
- Search: "API performance monitoring tracing"
📋 SKILL MASTERY CHECKLIST¶
Prometheus & Grafana Expertise¶
- [ ] Design and implement monitoring strategy
- [ ] Create complex PromQL queries
- [ ] Build comprehensive dashboards
- [ ] Configure multi-tier alerting
- [ ] Optimize Prometheus performance
- [ ] Implement high availability setup
ELK Stack Mastery¶
- [ ] Design scalable logging architecture
- [ ] Configure complex log processing pipelines
- [ ] Create effective log analysis workflows
- [ ] Implement security and compliance logging
- [ ] Optimize Elasticsearch performance
- [ ] Troubleshoot ELK stack issues
Tracing & APM Skills¶
- [ ] Implement distributed tracing strategy
- [ ] Configure auto and manual instrumentation
- [ ] Analyze complex trace data
- [ ] Optimize application performance
- [ ] Implement sampling strategies
- [ ] Correlate traces with metrics and logs
🎓 CERTIFICATIONS & RESOURCES¶
Recommended Certifications¶
- [ ] Prometheus Certified Associate (If available)
- [ ] Elastic Certified Engineer
- [ ] Grafana Certified Professional
- [ ] AWS/Azure/GCP Monitoring Certifications
Essential Books¶
- "Observability Engineering" - Honeycomb Team
- "Site Reliability Engineering" - Google SRE Team
- "Monitoring with Prometheus" - James Turnbull
- "Learning Elastic Stack" - Pranav Shukla
Online Learning¶
- "Complete Guide to Elasticsearch" - Udemy
- "Prometheus Monitoring" - Linux Academy
- "Grafana Fundamentals" - Official Training
- "OpenTelemetry Workshop" - Cloud Native Computing Foundation
✅ DAILY PRACTICE ROUTINE¶
Morning Check (15 mins)¶
- [ ] Review overnight alerts and incidents
- [ ] Check system health dashboards
- [ ] Validate monitoring pipeline status
- [ ] Plan daily observability tasks
Active Development (60-90 mins)¶
- [ ] Work on monitoring improvements
- [ ] Analyze observability data
- [ ] Optimize queries and dashboards
- [ ] Contribute to observability tools
Evening Review (20 mins)¶
- [ ] Document findings and learnings
- [ ] Update monitoring runbooks
- [ ] Plan next day's priorities
- [ ] Review observability metrics
🎯 MASTERY MILESTONES¶
Month 2: Foundation Complete¶
- [ ] Understand observability principles
- [ ] Basic Prometheus and Grafana setup
- [ ] Simple ELK stack deployment
- [ ] First distributed traces
Month 4: Intermediate Skills¶
- [ ] Complex monitoring setups
- [ ] Advanced dashboard creation
- [ ] Log analysis and troubleshooting
- [ ] Application instrumentation
Month 6: Advanced Implementation¶
- [ ] Production-ready observability stack
- [ ] Performance optimization
- [ ] Advanced analytics and correlation
- [ ] Automated alerting and response
Month 8: Expert Level¶
- [ ] Multi-cluster observability
- [ ] Custom tooling development
- [ ] Advanced troubleshooting
- [ ] Team mentoring and training
Month 12: Leadership¶
- [ ] Strategic observability planning
- [ ] Tool evaluation and selection
- [ ] Industry contributions
- [ ] Thought leadership
Observability Philosophy: "You can't improve what you can't measure, and you can't troubleshoot what you can't observe!"
🚀 Quick Start Action Plan¶
Day 1: Environment Setup¶
- [ ] Install Prometheus, Grafana, and ELK stack locally
- [ ] Create sample application with basic metrics
- [ ] Set up first dashboard and alert
- [ ] Generate sample logs and traces
Week 1 Goals¶
- [ ] Complete basic tutorials for each tool
- [ ] Monitor first real application
- [ ] Create comprehensive dashboard
- [ ] Set up basic alerting
Month 1 Targets¶
- [ ] Production-ready monitoring stack
- [ ] End-to-end observability implementation
- [ ] Performance optimization baseline
- [ ] Team knowledge sharing
Remember: Observability mastery comes from understanding your systems deeply and building the right instrumentation to gain insights into their behavior!