Diff-Aware Testing: Run Only the Tests That Matter for Each PR

Definition

Diff-aware testing is an approach where code changes in a pull request are analyzed to identify affected user flows and coverage gaps. In Zerocheck today, that analysis suggests tests for review while the PR check runs the active approved suite. The result is clearer coverage feedback without letting generated tests gate merge before approval.

Why it matters

Most CI pipelines run the entire E2E suite on every PR regardless of what changed. A CSS tweak to the settings page triggers 500 tests including checkout, onboarding, and admin flows that are completely unrelated. This wastes CI time (45 min full suite vs 5 min targeted), generates irrelevant flaky failures that erode trust, and makes engineers wait for feedback that doesn’t tell them anything useful about their change. Diff-aware testing eliminates this waste by only running what matters.

How teams handle it today

Most teams don’t have diff-aware testing at all. They either run the full suite (slow, noisy) or manually tag tests with labels and select subsets via CI config (fragile, gets out of date). Launchable and Buildkite Test Analytics offer predictive test selection based on historical failure data, but they optimize for speed, not coverage. They skip tests that usually pass rather than selecting tests that are relevant to the change. No mainstream tool maps code changes to affected user flows.

How Zerocheck approaches it

Zerocheck reads the PR diff, identifies affected or uncovered user flows, and saves generated tests as suggestions for review. The current PR check runs the active approved suite; newly suggested tests become runnable coverage only after approval.

How diff-aware test selection works

Diff-aware testing starts with a simple question: which user flows does this code change actually touch? Answering that question requires three steps.

First, the system reads the PR diff to identify every file that changed. In a typical pull request, that might be one component file, one API route, and a CSS module. The diff tells you exactly what was modified, added, or deleted.

Second, the system maps those changed files to the user flows they affect. This is the hard part. A file like src/components/CheckoutForm.tsx doesn't exist in isolation. It's rendered inside a checkout page, which is part of an "add to cart, enter payment, confirm order" flow. The mapping layer needs to understand that changing CheckoutForm.tsx means the checkout flow, the payment flow, and the order confirmation flow all need testing. But the user settings flow, the admin dashboard, and the onboarding wizard are unaffected and can be skipped.

Third, the system selects or generates only the tests that cover those affected flows. Instead of running all 200 tests in the suite, it runs the 5 tests that exercise checkout-related paths. Those 5 tests run in 3 to 5 minutes. The other 195 tests don't run at all because the code change can't possibly affect them.

Here's what a conceptual flow mapping configuration looks like:

# flow-map.yml - maps source files to user flows
flows:
  checkout:
    triggers:
      - src/components/CheckoutForm.tsx
      - src/components/CartSummary.tsx
      - src/api/payments.ts
      - src/hooks/useCart.ts
    tests:
      - add-item-to-cart
      - apply-discount-code
      - complete-checkout-flow
      - payment-error-handling
      - order-confirmation-email

  user-settings:
    triggers:
      - src/components/SettingsPanel.tsx
      - src/api/user-preferences.ts
    tests:
      - update-profile-info
      - change-password
      - toggle-notifications

  onboarding:
    triggers:
      - src/components/OnboardingWizard.tsx
      - src/api/onboarding.ts
    tests:
      - complete-onboarding-flow
      - skip-optional-steps
      - onboarding-progress-persistence

# PR changes: src/components/CheckoutForm.tsx
# Result: only "checkout" tests run (5 tests, ~4 min)
# Skipped: user-settings tests, onboarding tests (195 tests)

Full suite vs predictive selection vs diff-aware: a comparison

There are three main approaches to deciding which E2E tests to run on a pull request. Each makes a different trade-off between speed, coverage, and relevance.

Full suite execution is the default for most teams. Every PR triggers every test. For a 200-test suite, that means roughly 45 minutes of CI time per PR. You get 100% test coverage on every change, which sounds good in theory. In practice, 95% of those tests are checking code paths that the PR never touched. The result: long wait times, high CI costs, and a flood of flaky failures from unrelated tests that erode engineer trust in the pipeline. When a checkout test flakes on a PR that only changed the settings page, nobody investigates because everyone knows it's noise.

Predictive test selection (offered by tools like Launchable and Buildkite Test Analytics) takes a statistical approach. These tools analyze historical test results to identify which tests tend to fail and which almost always pass. They then skip the tests that "usually pass" to save time. A 200-test suite might drop to 60 tests and finish in 15 minutes. The problem: this approach optimizes for speed, not relevance. It skips tests based on their historical pass rate, not based on whether the current code change affects them. A test that rarely fails might be exactly the test you need when someone modifies the code it covers. Predictive selection is better than running everything, but it can miss real regressions on stable code paths.

Diff-aware test selection takes a fundamentally different approach. Instead of asking "which tests usually fail?", it asks "which tests are affected by this specific change?" It analyzes the PR diff, maps changed files to user flows, and runs only the tests that exercise those flows. A 200-test suite might drop to 5 to 10 tests and finish in 3 to 8 minutes. Every test that runs is directly relevant to the change. If all 5 tests pass, you have high confidence that the change didn't break anything it touches. If a test fails, it's almost certainly a real issue, not a flaky coincidence.

The trade-offs are worth stating honestly. Full suite execution is simple to set up and guaranteed to run everything, but it's slow and noisy. Predictive selection is faster and requires no code-to-flow mapping, but it optimizes for the wrong thing. Diff-aware selection is the fastest and most relevant, but it requires understanding how your codebase maps to user flows. It can also miss indirect dependencies: if changing a shared utility function affects a flow that isn't explicitly mapped, a naive diff-aware system might skip the relevant test. Good diff-aware implementations handle this by tracking transitive dependencies, but it's a real limitation that teams should understand.

CI time benchmarks: before and after

The time savings from diff-aware testing are straightforward to calculate, and they compound quickly across a team.

Consider a mid-size engineering team with a 200-test E2E suite. Each full suite run takes 45 minutes. The team merges roughly 50 PRs per week, and each PR triggers at least one full suite run (often two or three after rebases and review changes, but let's be conservative with one).

With full suite execution, that's 50 PRs multiplied by 45 minutes, which equals 2,250 minutes (37.5 hours) of CI time per week just for E2E tests. Each engineer waits 45 minutes for feedback on every PR. In practice, they context-switch to something else and come back later, which research from Microsoft suggests adds 15 to 25 minutes of cognitive ramp-up time on top of the raw CI wait.

With diff-aware testing, the average PR touches 2 to 3 files and maps to 5 to 10 relevant tests. Those tests run in 5 to 8 minutes. Using 6 minutes as the average, that's 50 PRs multiplied by 6 minutes, which equals 300 minutes (5 hours) of CI time per week. That's an 87% reduction in E2E CI time.

Here's the ROI calculation in concrete terms:

# Weekly CI time comparison
# Team: 10 engineers, 50 PRs/week, 200-test E2E suite

Full suite:       50 PRs x 45 min = 2,250 min/week (37.5 hrs)
Diff-aware:       50 PRs x  6 min =   300 min/week ( 5.0 hrs)
                                       ---------------------
Time saved:                          1,950 min/week (32.5 hrs)

# Developer wait time impact
Full suite:       45 min feedback loop + ~20 min context-switch tax
Diff-aware:        6 min feedback loop (short enough to wait for it)

# Annual CI cost savings (assuming $0.08/min for cloud CI)
Full suite:       2,250 min x 52 weeks x $0.08 = $9,360/year
Diff-aware:         300 min x 52 weeks x $0.08 = $1,248/year
                                                  -----------
Savings:                                          $8,112/year

# That's just the CI compute cost. The real savings are in
# developer time: 32.5 hours/week of faster feedback loops
# across the team, every single week.

Why no mainstream tool does this yet

Given the obvious benefits, you might wonder why diff-aware test selection isn't already standard in every CI pipeline. The answer is that it's a genuinely hard technical problem, and the difficulty lies in the mapping layer.

Traditional E2E test frameworks (Playwright, Cypress, Selenium) know nothing about your codebase's architecture. A Playwright test opens a browser, navigates to a URL, clicks buttons, and checks results. It has no idea which source files are involved in rendering that page or handling that click. The test and the code exist in completely separate worlds.

To build diff-aware selection, you need to bridge that gap. You need a system that can answer: "If src/api/payments.ts changed, which E2E tests need to run?" That requires understanding the application's architecture, which components render on which pages, which API routes serve which features, and which shared utilities are used across which flows.

Some teams try to solve this manually by adding tags or labels to their tests and maintaining a mapping file. This works initially but falls apart quickly. As the codebase evolves, the manual mapping gets stale. New files are added without updating the map. Shared components get used in new places without anyone realizing the test coverage implications. Within a few months, the mapping is so out of date that teams either abandon it or assign someone to maintain it full-time.

Static analysis tools can partially automate this by tracing import graphs and identifying which test files transitively depend on which source files. But E2E tests don't import application code directly. They interact with a running application through a browser. There's no import chain to follow from a Playwright test to the React component it exercises.

Zerocheck's approach sidesteps this. Because Zerocheck reads the PR diff and understands what the code change does at a semantic level, it can map changes to affected user flows without requiring a manually maintained mapping file or a static import graph. It sees that a change to CheckoutForm.tsx affects the checkout flow because it understands what CheckoutForm does in the context of the application. This is the same reason Zerocheck can generate tests from PR context rather than requiring engineers to write them manually.

The result is diff-aware test selection that stays accurate as the codebase evolves, without any additional maintenance burden on the engineering team. The mapping updates itself because it's derived from understanding the code, not from a static configuration file that someone has to remember to update.

Related terms

E2E Testing PR Gating Flaky Test

Change-Aware Testing →