5. Distributed Systems & Reliability
Design for failure: idempotency, outbox, retries with jitter, circuit breakers, DLQs, and sagas.
Question: What is the Transactional Outbox pattern and why is it useful?
Answer: The Transactional Outbox pattern ensures that events are reliably published in response to a database change. It works by saving the business state and the event to be published into an "outbox" table within the same local database transaction. A separate relayer process then reads from this table and guarantees delivery of the event to a message broker.
Explanation: This pattern solves the "dual write" problem. If you first write to your database and then make a separate call to a message broker, the second call might fail, leaving your system in an inconsistent state. By writing the event to an outbox table in the same transaction as the business data, you guarantee that the event will be captured if and only if the business data is successfully saved. The relayer then handles the "at-least-once" delivery to the message broker.
Question: How do you handle idempotency for write operations in an API?
Answer: Idempotency is typically handled by requiring the client to generate and send a unique key (e.g., Idempotency-Key
header) for each state-changing request. The server tracks these keys for a period of time. If a request comes in with a key that has already been processed, the server can safely skip the operation and return the previously generated result.
Explanation: This pattern is crucial for building reliable systems, as it makes retries safe. A client can safely retry a request that timed out without fear of creating duplicate transactions or objects. The idempotency key store is often a fast key-value store like Redis with a TTL on the keys, ensuring they are kept long enough to handle duplicate requests but not forever.
Question: How do you implement resilient retries and circuit breaking?
Answer: Use bounded exponential backoff with jitter for retriable errors, and a circuit breaker to shed load when a dependency is failing persistently.
Explanation: Backoff reduces thundering herds; circuit breakers prevent cascading failures and fast-fail unhealthy paths.
base := 100 * time.Millisecond
for attempt := 0; attempt < 5; attempt++ {
err := call()
if err == nil { break }
if !isRetryable(err) { return err }
sleep := base << attempt
jitter := time.Duration(rand.Int63n(int64(sleep / 2)))
time.Sleep(sleep/2 + jitter)
}
Question: How do you classify errors for retries?
Answer: Classify by cause and semantics: retry 5xx, timeouts, and resource-exhausted; do not retry 4xx (except 409/429 with backoff), and avoid retrying non-idempotent operations.
Explanation: Use typed/domain errors and transport codes to drive retry policy.
Question: How do you design Dead Letter Queues (DLQs) and retries for message processing?
Answer: Use bounded retries with exponential backoff and jitter; on permanent failures, route messages to a DLQ with cause metadata and alerting.
Explanation: Separate transient from permanent errors. Ensure DLQ replay tooling and idempotent consumers.
Question: What is the Saga pattern, and when do you use it?
Answer: A Saga coordinates a sequence of local transactions with compensating actions for each step to achieve eventually consistent workflows without distributed transactions.
Explanation: Use for multi-service business processes (e.g., create order → reserve inventory → charge card). Model compensations and idempotency explicitly.
Question: Can you achieve exactly-once processing?
Answer: In practice you implement at-least-once delivery with deduplication and idempotency to get effective exactly-once semantics at the application level.
Explanation: Combine idempotent handlers, dedupe keys (inbox tables), and outbox to avoid double side effects.