84% of CI failures are flaky, not real bugs. Here’s how to identify, classify, and fix flaky tests - without re-running and hoping.
The top causes are: timing/race conditions (tests interact before the page is ready), non-deterministic test data (random IDs, ordering), environment drift (staging differs from production), and third-party dependency changes (Stripe, OAuth providers).
Track the percentage of test transitions from pass-to-fail that revert on re-run. Google found that 84% of such transitions are flaky. Most CI platforms (Buildkite, CircleCI) provide this data.
If a test has been quarantined for 2+ sprints with no fix, delete it. A permanently quarantined test provides zero value and clutters your suite. Better to have no test than a test everyone ignores.