7. Observability for Production

Make systems diagnosable: structured logs, RED metrics, traces, and safe profiling in prod.

Question: What are the "three pillars of observability," and how do you implement them in a Go service?

Answer: The three pillars are logs, metrics, and traces.

  • Logs: Detailed records of specific events. Implemented with a structured logging library (e.g., log/slog, zerolog) to output JSON.

  • Metrics: Aggregatable numerical data about the system's health (e.g., request rates, error rates, duration). Implemented using a library like Prometheus client Go.

  • Traces: Show the end-to-end journey of a request through a distributed system. Implemented using the OpenTelemetry SDK.

Explanation: These three pillars provide a complete picture of system behavior. Metrics tell you that a problem is occurring (e.g., p99 latency is high). Traces tell you where in the system the problem is (e.g., a specific downstream service call is slow). Logs provide the detailed, low-level context to understand why it happened. A request ID should be present in all three to correlate them.

// OpenTelemetry Tracing Example
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

var tracer = otel.Tracer("my-service")

func Checkout(ctx context.Context, userID string) {
ctx, span := tracer.Start(ctx, "checkout")
defer span.End()
span.SetAttributes(attribute.String("user.id", userID))
    // ... business logic ...
}

Question: How do you safely use pprof and continuous profiling in production?

Answer: Expose pprof on a protected admin port or behind auth; never on a public interface. For low-overhead always-on profiling, use continuous profilers (e.g., Pyroscope, Parca).

Explanation: pprof reveals internal state and can be heavy; secure endpoints and sample conservatively.

Question: What RED metrics should every service expose?

Answer: Requests, Errors, Duration (histogram). Segment by route/method/result; avoid high-cardinality labels.

Explanation: RED enables quick SLO-based alerting and capacity insights.

Question: How do you choose histogram buckets for latency SLOs?

Answer: Choose buckets around your SLO boundaries (e.g., 50/100/200/400ms for a 200ms p99), with extra resolution near the target.

Explanation: Proper buckets enable accurate percentiles without high memory; avoid overly granular buckets that explode cardinality.

Question: How do you trace cross-service calls with OpenTelemetry?

Answer: Propagate context with W3C traceparent/tracestate, inject/extract into HTTP/gRPC headers, and create spans per significant operation.

Explanation: Consistent propagation yields end-to-end traces for latency root-cause.

Question: Histograms vs summaries vs counters — when to use which?

Answer: Use counters for monotonic events, gauges for instant values, and histograms for latency distributions. Avoid summaries in Prometheus unless you need client-side quantiles.

Explanation: Histograms enable server-side quantiles and exemplars; choose bucket bounds matching SLOs.

Question: How do you prevent high-cardinality blowups?

Answer: Limit label values, avoid user IDs in labels, cap unique dimensions, and sample logs/traces.

Explanation: High-cardinality metrics explode memory and CPU in TSDBs; prefer structured logs for rare-details.

Question: What are best practices for structured logging?

Answer: Use consistent keys (e.g., trace_id, request_id, user_id), redact PII, include error kind and cause, and prefer JSON output.

Explanation: Consistency enables reliable querying and correlation across systems.