Engineering

How Regresco classifies test failures: regression vs broken locator vs flaky

By Sergei Pustovalov · 9 May 2026 · 7 min read

A red regression dashboard tells you nothing on its own. Three failures could mean three real product bugs, three brittle selectors that need updating, three flaky steps that will pass on the next run, or any combination. Most teams treat all three the same way: glance, sigh, ignore. That's how regressions reach production.

The default user experience of test runners is "did it pass or fail?" That's the wrong primary signal for a small team without dedicated QA. The right primary signal is "what kind of failure is this and does it need my attention?"

This is how we built Regresco's classifier. It's not magic, it's heuristics applied consistently. Most teams could implement something similar in their own setup if they invested the time.

The three categories

Every failed step gets classified into one of:

REGRESSION

The product actually broke. The test was reasonable, the selectors were stable, and on this run the expected outcome didn't happen. This is the category that matters: real bugs that need a fix before promoting to production.

BROKEN_LOCATOR

The product is fine, but the test was looking for a CSS class, ID, or text that no longer exists. The recent UI refactor renamed something. The fix is to update the selector, not roll back the deploy.

FLAKY

The step has been failing intermittently across runs with no relevant code change. Timing race condition, network instability, state pollution. More on flake categories here. The right action depends on severity and the importance of the path being tested.

How we decide

The classifier looks at three signals: the error message, the failure mode, and the run history of that specific step.

Error message patterns

Browser-test errors come in recognizable shapes. Some are deterministic signals:

"strict mode violation: locator resolved to N elements" → BROKEN_LOCATOR (the selector is matching the wrong thing now)
"net::ERR_CONNECTION_REFUSED" or SSL errors → FLAKY (network, not a real product break)
"Target page closed" or "browser crashed" → FLAKY (transient browser instability)
"Timeout exceeded waiting for selector" with no recent matching DOM → BROKEN_LOCATOR (element disappeared, not a regression in behavior)
Assertion mismatch (expected X, got Y) with no error keywords → REGRESSION (the product behaved unexpectedly, the test was right)

Each pattern has been mapped from real failure logs we've seen. The list grows as we encounter new shapes.

Failure mode + step type

Some patterns only make sense relative to what the step was trying to do. A "click" step that fails with "not visible" usually means the element is now hidden behind a modal or overlay (BROKEN_LOCATOR). A "fill" step that fails with the same error usually means the form changed structure (also BROKEN_LOCATOR). An "assert" step that fails with the expected text not appearing means the page rendered something different (REGRESSION).

The classifier checks step type and adjusts the verdict.

Run history

The strongest signal is what this specific step has done across the last 30 runs. Three patterns:

Always passed, fails this run, no recent code on the affected path → REGRESSION (high confidence)
Passes most runs, fails 1-3 in a window → FLAKY (mark with severity badge)
Started failing consistently from a specific run onwards → either REGRESSION (if behavior changed) or BROKEN_LOCATOR (if the test setup changed). Cross-checked with the error pattern.

History resolves ambiguous cases that error message alone can't. A timeout error on a step that has timed out 6 times in the last 20 runs is FLAKY. A timeout on a step that has never timed out before, on the same week as a deploy that touched that area, is REGRESSION until proven otherwise.

Why three categories and not five

Earlier versions had separate buckets for "network error", "browser crash", "selector ambiguity", and "assertion mismatch". User feedback was unanimous: too many categories, hard to know what to do with each.

The actual decision a triage person makes is binary at first ("does this need my attention right now?") and then trinary ("if yes, do I fix the product or fix the test?"). Three categories map cleanly to three actions: roll back the deploy (REGRESSION), update the test (BROKEN_LOCATOR), or wait and see if it self-resolves (FLAKY). More granularity is just noise at this stage.

What we don't classify

Two things the classifier explicitly doesn't try to do:

Severity ranking within a category. A REGRESSION on the login flow and a REGRESSION on the help page are both labeled REGRESSION. The triage person decides which is more urgent. We don't try to prioritize across user paths.
Fix suggestions for REGRESSION. If we knew what the bug was, we could fix it. The classifier's job is to flag the failure as worth investigating, not to do the investigation.

For BROKEN_LOCATOR specifically, we do go further: the auto-heal layer tries alternative selectors at runtime, and if one passes, the step gets marked as auto-healed with the suggested replacement surfaced in the editor. But that's a separate system from the classifier.

When the classifier is wrong

The classifier is right most of the time, not always. Edge cases that trip it up:

A real regression that happens to manifest as a network error (an upstream service started returning 500s because of a deploy)
A flaky step that always fails the same way, looking like a deterministic regression
A BROKEN_LOCATOR that's actually a regression (the missing element wasn't refactored, it failed to render)

For these, the dashboard shows the classifier's verdict but doesn't hide the raw error and step history, so the triage person can override. The cost of being wrong on a specific failure is small as long as the classifier is right on the bulk and the override is one click.

See the classifier in your own dashboard

Free plan, 5 runs a month, no card. Run a few flows, intentionally break one, and watch the classification badges update across runs.

Start free Flaky tests deep-dive