Insights / Reliability · 2024-11-19 · 5 min read

Retries, timeouts, and the orchestrator's job

An orchestrator's main value isn't drawing a nice DAG. It's handling the moment a task fails — and that only works if your tasks are safe to retry.

People adopt an orchestrator for the dependency graph and the pretty UI. The actual value shows up at 3am, when a task fails and the question is whether retrying it is safe. If it is, the orchestrator quietly recovers and you sleep. If it isn't, you're paged to reason about side effects.

Retries presuppose idempotency

Automatic retries are only safe if running a task twice is harmless. That's why idempotency is upstream of everything: it's the property that lets the orchestrator do its job without a human in the loop. Configure retries, sure — but earn them by making the underlying step replace its partition rather than append.

Timeouts prevent the silent hang

The nastier failure isn't a crash — it's a task that hangs forever, holding a slot, blocking everything behind it, alerting no one. Set timeouts on tasks so a stuck job fails and frees the pipeline, rather than wedging it until someone notices the dashboard went stale.

A crash is loud and recoverable. A hang is silent and contagious. Timeouts turn the second into the first.

Make failure legible

When a task does fail for good, the message should name what broke and what to do — the table, the partition, the check that failed. An orchestrator that retries safely, times out hangs, and fails with a useful message turns most incidents into non-events.

← All insights