Insights / Practice · 2025-09-09 · 5 min read

Designing for late-arriving data

Events don't always show up on time. A device was offline, a partner's batch was delayed, a retry landed a day later. If your pipeline assumes today's data is complete tonight, it's quietly wrong.

Most pipelines are built as if each day's data is final the moment the day ends. Reality is messier: some events for Monday arrive on Tuesday, or Friday. Treat the day as closed too early and you under-count; treat it as forever open and you never get stable numbers. The job is to handle lateness on purpose instead of pretending it doesn't happen.

Separate event time from processing time

The single most useful distinction is when something happened versus when you saw it. Partition and aggregate by event time, but track processing time so you can reprocess. A late event for Monday should land in Monday's partition, not in the partition for the day it arrived.

Use a reprocessing window

Don't reprocess all history every night, and don't freeze a day instantly. Pick a window — say, re-run the last N days each night — wide enough to catch the realistic tail of late arrivals. Because your steps are idempotent, re-running those days is safe and simply corrects the totals.

A daily number that keeps nudging for three days and then settles isn't a bug — it's an honest pipeline admitting that data arrives late.

Tell people which numbers are still moving

Mark recent partitions as provisional. A dashboard that flags "last 3 days may still update" prevents someone from screenshotting a number that will change by morning. Lateness handled openly builds trust; lateness hidden erodes it the first time a "final" figure moves.

← All insights