Insights / Migration · 2026-06-15 · 6 min read

Backfills without the reconciliation spreadsheet

If every reprocessing job ends with someone hand-building a spreadsheet to prove the numbers match, the proof is in the wrong place. It belongs in the pipeline.

A backfill is just running your pipeline over old data. It should be boring. The reason it usually isn't is that teams can't easily answer one question afterward: did the new output actually match what it should be? So someone exports both versions, drops them in a spreadsheet, and eyeballs the diffs at midnight.

Make idempotency the precondition

You cannot safely backfill a pipeline whose steps aren't idempotent. If re-running a partition appends instead of replacing, every backfill risks doubling data. So the first move is always the same: make each step own a partition and overwrite it atomically. Once that holds, a backfill is a loop over dates calling the exact same code path as production — no special "backfill mode" with its own bugs.

Put reconciliation in the pipeline

Instead of comparing old and new by hand, write the comparison as a step. After the backfill writes to a staging location, run a check that compares row counts and a few aggregate sums against the live table, partition by partition, and fails loudly on a mismatch beyond tolerance.

-- promote only if the backfill reconciles
WITH d AS (
  SELECT date, abs(new.total - old.total) AS gap
  FROM staging new JOIN live old USING (date)
)
SELECT date, gap FROM d WHERE gap > tolerance;  -- must be empty

The spreadsheet exists because the proof was manual. Move the proof into code and it runs every time, for free.

Promote, don't overwrite in place

Backfill into a staging table, reconcile, then swap. If reconciliation fails, production is untouched and you investigate calmly. If it passes, the swap is atomic and instant. The held-breath weekend disappears — not because backfills got rarer, but because they stopped being a leap of faith.

← All insights