Insights

Field notes.

Short, practical writing on the parts of data engineering that don't make conference talks but do keep your numbers honest. 16 pieces and counting, since 2023.

2026-06-15

Backfills without the reconciliation spreadsheet

If every reprocessing job ends with someone hand-building a spreadsheet to prove the numbers match, the proof is in the wrong place. It belongs in the pipeline.

Migration
2026-05-18

Data contracts that survive contact with reality

Why most contract initiatives stall halfway, and the much smaller version that actually ships and keeps shipping.

Practice
2026-03-02

Make your pipeline safe to run twice

Idempotency is the cheapest reliability you can buy. A concrete pattern for batch jobs, with the failure cases it prevents.

Engineering
2026-01-20

One definition per metric

"Active users" was 41,000 in the board deck, 38,500 in the growth dashboard, and 44,200 in the email report. None were wrong. That's the problem.

Modeling
2025-11-24

An alert nobody mutes

The three data-quality checks worth wiring up first, and how to tune their thresholds so the team keeps trusting them.

Reliability
2025-09-09

Designing for late-arriving data

Events don't always show up on time. A device was offline, a partner's batch was delayed, a retry landed a day later. If your pipeline assumes today's data is complete tonight, it's quietly wrong.

Practice
2025-07-15

Slowly changing dimensions, without the dogma

A customer changes plan. A product changes category. Do your historical reports use the old value or the new one? That single question is what slowly changing dimensions are really about.

Modeling
2025-04-28

Partition for how you query, not how you load

Teams partition by load date because that's how the data arrives. Then every analytical query scans the whole table, because nobody queries by load date. Partition for the reader.

Engineering
2025-02-11

Tests you'll still trust in a year

Most data tests start strict and end ignored. The ones that survive are few, specific, and tied to something a human actually cares about.

Practice
2024-11-19

Retries, timeouts, and the orchestrator's job

An orchestrator's main value isn't drawing a nice DAG. It's handling the moment a task fails — and that only works if your tasks are safe to retry.

Reliability
2024-08-27

Schema changes that don't wake anyone up

A column gets renamed upstream. Somewhere downstream a join silently returns nulls, a metric drops, and three days later someone asks why the number looks wrong. Schema changes are the most common quiet break there is.

Engineering
2024-05-14

The quiet cost of the nightly full refresh

It worked when the table had a million rows. Now it has eight hundred million, the nightly job rebuilds all of it, and the warehouse bill has a hockey-stick shape nobody can explain.

Cost
2024-02-20

Where bad records should go

One malformed row shouldn't fail a million-row job. But it also shouldn't vanish. The records you can't process need a destination, not a silent drop.

Engineering
2023-10-03

Lineage you can actually use during an incident

Lineage diagrams look impressive in a slide. The test is whether, at 2am with a wrong dashboard, you can answer one question fast: what feeds this, and what does this feed?

Practice
2023-06-13

Streaming is not a default

Streaming is exciting, and for a narrow set of problems it's the only right answer. For most analytics, it's a large bill and an operational burden bought to solve a latency problem nobody had.

Architecture
2023-02-28

Naming things in the warehouse

Two of the hard problems in computing are cache invalidation and naming things. In a data warehouse, the second one quietly costs you more than you'd think.

Modeling