Insights / Reliability · 2025-11-24 · 5 min read

An alert nobody mutes

The fastest way to ruin data observability is to turn on every check at once. The second-fastest is to set thresholds so tight that the channel becomes noise. Here's where to start instead.

Every data team that adopts monitoring goes through the same arc. They install a tool, enable hundreds of checks, get flooded with alerts, and within a month everyone has muted the channel. The monitoring is technically running and practically useless. The goal isn't coverage — it's a small set of alerts the team still trusts a year later.

Start with three checks

On your most important tables — not all of them — wire up exactly these, in this order:

1. Freshness

Did the data arrive on time? This is the single highest-value check, because stale data is the most common failure and the easiest to miss. A dashboard showing yesterday's numbers looks fine right up until someone makes a decision on it.

-- alert if the newest event is older than expected
SELECT max(event_ts) AS latest
FROM critical_table;
-- page if  now() - latest > threshold

2. Volume

Did roughly the expected number of rows show up? A partition that's 90% smaller than usual almost always means an upstream break — a failed source, a filter gone wrong, a partial load. Compare against a trailing window, not a fixed number, so the check survives normal growth.

3. Distribution on one key column

Did the shape of the data stay sane? A null rate that jumps from 0% to 30%, or a categorical column that suddenly has a new value, signals an upstream schema change before it corrupts anything downstream. You don't need this on every column — pick the one or two the business actually depends on.

Freshness, volume, distribution. Three checks on the tables that matter will catch the large majority of real incidents, and they're cheap enough to keep trustworthy.

Tune for trust, not coverage

The thresholds matter more than the checks. A few principles that keep a channel alive:

Every alert must be actionable. If the on-call engineer can't do anything about it, it shouldn't page — it should go in a digest, or nowhere.
Compare to a baseline, not a constant. Hard-coded thresholds rot as the business grows. Trailing windows adapt.
Tune false positives aggressively in week one. The first noisy alert that turns out to be nothing is the moment people start learning to ignore the channel. Fix it immediately.
One owner per alert. An alert that pages "the team" pages no one.

Expand only when the first three are trusted

Once the team has acted on a freshness alert that saved them, and stopped getting paged for nothing, then add more checks and more tables. Coverage earned this way sticks. Coverage installed all at once gets muted. We'd rather leave a team with three checks they believe than thirty they ignore.

← All insights