Flaky tests erode trust in your CI pipeline. Here's our systematic approach to quarantining, root-causing, and eliminating non-deterministic failures.
A flaky test is one that passes and fails intermittently without any code change. Sounds minor — until you realize what it does to your team. When tests flake, developers stop trusting the suite. They re-run failures "just to check." They merge despite red builds. Eventually, the entire CI pipeline becomes background noise that nobody watches.
The real cost isn't the flaky test itself — it's the erosion of the testing culture you worked so hard to build.
The moment a test is identified as flaky, move it to a quarantine suite. It still runs, but it doesn't block the build. This preserves CI trust while you investigate.
Run the flaky test 50-100 times in isolation. If it fails consistently at a certain rate (e.g., 15% of runs), you have a reproducible pattern. Classify the root cause: timing, state, environment, or data.
Adding retry logic or increasing timeouts is a band-aid, not a fix. For timing issues, use proper wait-for-condition patterns. For shared state, ensure test isolation. For environment variance, containerize your CI runner.
After fixing, track the test's pass rate over 2 weeks before promoting it back to the main suite. If it flakes again, the root cause analysis was incomplete.
A healthy suite has a flake rate below 0.5%. Above 2%, you have a systemic problem that needs dedicated investment. Above 5%, your CI pipeline is effectively decorative.
The best flake strategy is writing stable tests from the start. Our frameworks enforce patterns that prevent the most common flake causes: auto-waiting selectors, isolated test contexts, deterministic test data factories, and containerized execution environments. Prevention costs less than every triage cycle you'll ever run.
We help teams implement exactly what this article describes — from strategy to working code. Let's talk about your project.