17. Observability

Emit logs, metrics, and traces with consistent context to make debugging fast and systematic.

Q1 What are the three pillars of observability and why are they important?

Answer: The three pillars are logs, metrics, and traces.

Logs: Detailed, timestamped records of discrete events. Should be structured (e.g., JSON) for machine readability.
Metrics: Aggregated numerical data over time (e.g., request latency, error rates) that are optimized for storage and querying.
Traces: Show the end-to-end journey of a single request as it flows through a distributed system.

Explanation: Together, these provide a complete picture of a system's health. Metrics tell you that there is a problem (e.g., p99 latency has spiked). Traces tell you where the problem is (e.g., in the database call of the auth service). Logs give you the detailed, low-level context of what happened at that specific point. OpenTelemetry is the emerging standard for instrumenting code to generate all three signals.

Q2 How do you implement structured logs with correlation IDs?

Answer: Attach a request ID to log records via contextvars or middleware and log as JSON for easy correlation across services.

Explanation: Include key fields like trace_id, span_id, user_id.

import logging, json, contextvars
req_id = contextvars.ContextVar("req_id", default="-")
class JsonFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({"msg": record.getMessage(), "req_id": req_id.get()})

Q3 What metrics should you emit by default?

Answer: RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources.

Explanation: Use histograms with sensible buckets for latencies; add high-cardinality labels sparingly.