17. Observability
Emit logs, metrics, and traces with consistent context to make debugging fast and systematic.
Question: What are the three pillars of observability and why are they important?
Answer: The three pillars are logs, metrics, and traces.
Logs: Detailed, timestamped records of discrete events. Should be structured (e.g., JSON) for machine readability.
Metrics: Aggregated numerical data over time (e.g., request latency, error rates) that are optimized for storage and querying.
Traces: Show the end-to-end journey of a single request as it flows through a distributed system.
Explanation: Together, these provide a complete picture of a system's health. Metrics tell you that there is a problem (e.g., p99 latency has spiked). Traces tell you where the problem is (e.g., in the database call of the auth service). Logs give you the detailed, low-level context of what happened at that specific point. OpenTelemetry is the emerging standard for instrumenting code to generate all three signals.
Question: How do you implement structured logs with correlation IDs?
Answer: Attach a request ID to log records via contextvars or middleware and log as JSON for easy correlation across services.
Explanation: Include key fields like trace_id
, span_id
, user_id
.
import logging, json, contextvars
req_id = contextvars.ContextVar("req_id", default="-")
class JsonFormatter(logging.Formatter):
def format(self, record):
return json.dumps({"msg": record.getMessage(), "req_id": req_id.get()})
Question: What metrics should you emit by default?
Answer: RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources.
Explanation: Use histograms with sensible buckets for latencies; add high-cardinality labels sparingly.