Engineering

Flaky tests: why they happen, how to find them, how to fix them

By Sergei Pustovalov · 9 May 2026 · 8 min read

Flaky tests are the single largest reason teams stop trusting their test suites. A test that passes 9 times out of 10 isn't telling you anything useful. It's a coin flip with extra steps, and after a few weeks of the dashboard going amber for no reason, the team learns to ignore failures. That's how real regressions slip through.

This article covers the five common root causes of flakiness, how to find which of your tests are actually flaky (versus broken), and how to decide between fixing and deleting each one.

Five root causes

The vast majority of flaky tests trace back to one of these five categories:

1. Timing / race conditions

The test asserts on something before the page is ready. The element is in the DOM but not yet visible, or visible but not yet hydrated, or hydrated but the click handler hasn't attached. The test passes when the page happens to be fast and fails when it isn't.

Symptom: the test fails on slow CI runners, passes locally. Or fails 1 in 10 runs randomly.

Fix: replace fixed timeouts (await sleep(2000)) with explicit waits for the actual condition (await expect(button).toBeVisible()). Modern Playwright and Cypress have auto-waiting, use it.

2. Network instability

The test depends on a real network call (your API, a third-party service, a CDN). Sometimes the request takes 200ms, sometimes 4 seconds, sometimes it fails entirely. The test fails when the network is slow.

Symptom: failures correlate with deploys, third-party outages, or specific times of day.

Fix: increase timeouts to realistic values (5-30 seconds for navigation, 5-10 for API calls). For external services, mock them. For your own API, run the test against a known-stable staging environment, not prod.

3. Test order dependency

Test A leaves state that Test B implicitly relies on. When A is skipped or runs after B, B fails. Common with shared databases, shared user accounts, shared cache.

Symptom: the test fails in CI but not locally, or fails when you run it in isolation, or passes/fails depending on test execution order.

Fix: every test should set up its own state and clean up after itself. If two tests share a user, give them different users. If they share a project, give them different projects. The cost of isolation is real but smaller than the cost of intermittent flakes.

4. State pollution

The test leaves cookies, localStorage, or browser state that affects later runs. Or the test runs against a database that wasn't reset between runs.

Symptom: the first run after a long break passes, subsequent runs fail. Or vice versa.

Fix: clear browser state before each test (most frameworks support this with a one-line config). Reset the test database between runs, or use ephemeral test data with a unique-per-run identifier (UUID prefix on user emails, project names, etc.).

5. Resource contention

The test runner is sharing CPU, memory, or browser instances with other things. The browser is slow, animations don't complete, JavaScript timeouts fire late.

Symptom: failures correlate with parallel test execution or busy CI.

Fix: reduce parallelism if your CI runner is undersized. Or upsize the runner. Don't run tests on the smallest free tier of your CI provider.

How to find them

A test that fails once is not flaky. A test that fails twice in different ways is not flaky. A test is flaky when it fails sometimes and passes other times with no relevant code change between runs.

The way to identify flakes is to look at run history, not single runs. If you have 30 runs of a test and 4 failed across the same week, with no commit on the affected code path, that's a flake. If 30 runs all passed and the latest one failed, that might be a real regression.

Heuristics that work in practice:

Pass rate over the last 30 runs. <90% with no recent code change = flaky.
Failure mode consistency. Real regressions fail the same way every time. Flakes fail in different ways (timeout here, assertion mismatch there).
Time-of-day correlation. Flakes that always fail at 2am are usually network-related.
CI vs local divergence. If a test only fails in CI, it's almost certainly a flake (timing, resource, or environment issue).

Most testing tools either don't track this or surface it badly. Regresco classifies flaky tests automatically based on run history and tags them in the dashboard so you don't have to triage manually. More on the classification approach here.

The delete-vs-fix decision

Once you've identified a flaky test, the question is: fix it or delete it?

Fix it if:

It covers a critical user path (login, checkout, signup)
The flakiness is one of the five categories above and the fix is obvious
You can fix it in under an hour

Delete it if:

It tests a low-traffic flow that customers won't notice if it breaks
You've tried to fix it twice and it stayed flaky
The fix would require redesigning the test from scratch and you don't have time
The test has been skipped or quarantined for more than two months

Most teams keep too many flaky tests around because deleting feels wasteful. It isn't. A flaky test is worse than no test, because it trains the team to ignore failures. If you can't fix it, kill it. The team's trust in the dashboard is the real asset, not the test count.

Prevention

The best flaky-test strategy is to write fewer of them in the first place. Practices that help:

Use auto-waiting selectors. Never sleep for a fixed duration.
Test against a stable staging environment, not production.
Isolate test state. Each test should bring its own data.
Run new tests 10 times in a row before merging. If they pass 10/10, ship them. If 9/10, fix the flake before merging.
Track pass rates per test over time. New tests that drop below 95% in their first month get auto-quarantined for review.

That last one is the most underrated. Teams ship a new test, it joins the suite, and no one notices when its pass rate drops below acceptable. Pass-rate alerting on individual tests catches this before it pollutes the dashboard.

Tired of debating "is it real or flaky?"

Regresco classifies failures as regression, broken locator, or flaky based on run history, so the dashboard tells you which red runs actually need your attention. Free plan is 5 runs a month.

Start free How we classify failures