Understanding E2E Test Flakiness and Strategies for Improvement

In this article, we will delve into a prevalent challenge faced when creating an end-to-end (E2E) test suite: flakiness. Flaky tests are those that fail unexpectedly despite being designed to pass. Given the inherent complexity of E2E tests, achieving the same level of stability as unit tests is often unrealistic; however, our goal is to minimize the impact of this flakiness to ensure tests remain useful.

The Challenge

Instability in E2E tests can lead to various issues. On one hand, it undermines the advantages these tests are intended to provide; on the other hand, it raises the maintenance costs of the suite. Let’s examine some of the ways this instability manifests.

Frustration

One of the most immediate effects is the frustration it causes among team members who must repeatedly confront random E2E failures. Depending on your continuous integration (CI) setup, such failures might necessitate manual test restarts, disrupt subsequent integration steps, or simply slow down the entire process. This ongoing annoyance can make it harder to persuade developers to create and maintain E2E tests.

Random Variability

Beyond mere annoyance, random variability in E2E tests complicates the identification of issues that arise in a nondeterministic fashion. For example, if a feature in your application fails only once every 20 attempts, E2E tests should ideally catch this problem. However, if your team tends to rerun tests until they succeed, you may miss the subtle signals hidden within the noise.

Trust Erosion

When tests are frequently rerun due to errors, it becomes challenging to trust the results. Flaky tests breed skepticism about their reliability, causing the team to regard failing E2E CI jobs as mere nuisances. This undermines the very benefits of integrating automated tests into your workflow.

Statistical Overview

Before we continue, let’s revisit the foundational statistics of E2E tests. Similar to medical testing, an E2E suite can be seen as a mechanism for detecting bugs:

Positive result: Some tests fail, indicating a regression in the application.
Negative result: All tests pass, suggesting no issues.

False Positives

Flaky tests represent instances of false positives, where tests fail without any actual regression occurring. To assess flakiness, we can apply concepts from probability theory. This probability can be expressed as the ratio of false positive results to the total number of tests executed. By stabilizing the code branch and conducting multiple test runs, we can evaluate these metrics for both:

The overall test suite
Individual tests

For simplicity, we can ignore the inverse scenario where tests pass randomly despite expected failures, typically indicative of inadequate test coverage, which can be remedied by adding more test cases.

The “Paradox”

Generally, a test suite is considered to fail even with a single test failure. This leads to an unintuitive correlation between the stability of individual tests and the overall suite. For instance, if our tests exhibit a flakiness rate of one random failure in six runs, we can liken a random failure to rolling a die. With just one test, the calculations are straightforward:

? - tests pass as intended (true negative)
? - random failure of the suite (false positive)

When we have two tests, we can liken the scenario to rolling two dice: if either die shows a one, the test fails, resulting in a suite failure. To function as expected, both tests must pass, and assuming independence, we can multiply the probabilities to find the overall likelihood of both tests passing. The final results would be:

25/36, about a 0.69 chance of true negative results
1 - 25/36 = 11/36, or approximately a 0.31 chance of a false positive

As demonstrated, adding a new test significantly increases the likelihood of false positive results.

Probability Insights

The general formula for the false positive rate of the suite is as follows:

(Ps) = (Pt) ^ N

Where: - Ps = probability of the suite failing randomly - Pt = probability of one test failing randomly - N = number of tests

With this formula, we can observe that the flakiness of individual tests considerably affects the overall stability of the test suite.

Causes of Flakiness

Now that we understand the relationship between the number of tests and their stability, let’s explore potential reasons for random test failures.

Lack of Isolation

Depending on your system architecture, achieving complete isolation of tests can be challenging, especially when interfacing with external systems. For example, the application I work on includes the following backends:

A modern server running from a Docker container
A legacy server that remains outside of a container
Data processing scripts that generate static files utilized by the application
Various third-party integrations, some direct and others via a proxy from the modern server

Any non-isolated server accessed during E2E tests can create complications:

If the server is down, tests may fail for reasons unrelated to code changes.
If the server maintains state, running tests concurrently can lead to random failures due to data collisions.

#### Solution: Isolation and Mocking

To ensure necessary isolation from external systems, consider the following options:

Transition more infrastructure to be executed specifically for each E2E job, easily achieved with Docker.
Implement dummy proxies on your backend, ensuring these mock implementations are only utilized during tests.
Mock backend requests using your E2E framework.

Option one allows for comprehensive testing of both backend and frontend. Options two and three enable frontend testing without needing to validate the backend, diverging from traditional E2E testing principles but sometimes necessary.

Data Leakage

Sharing data among tests can result in unexpected failures, particularly when tests run in parallel or in arbitrary orders, leaving behind residual data. Typically, I strive to clean up any data I create. For instance, when testing creation and deletion functionalities, I incorporate them into a single test. For other operations, I aim to revert changes within the test itself.

#### Dynamic Data Creation in Tests

Previously, I ran multiple instances of a test runner against the same backend and database. Any data sharing between these tests led to random failures. To resolve this, I modified my tests to rely primarily on dynamically created data right before execution. Although this migration was time-consuming, it allowed for parallel testing within a single CI job.

Random Test Issues

A few years back, writing E2E tests posed challenges due to inadequate state tracking by testing tools, necessitating manual waits to ensure the runner didn’t interact with the application while data was still loading. Modern tools, like Cypress, have improved in this regard, but random issues can still arise from the tests themselves. Some examples include:

Timeouts in slower application segments
Inadequate wait conditions, such as:
- Complicated loading logic not handled by default waits
- Failing to wait for manual cleanup at the test's conclusion, resulting in premature reloads during subsequent tests
CI launching a test before the server and database are fully prepared

Random Application Issues

Most critically, random application failures can occur. This scenario is frustrating for both users and developers, as automated tests may require multiple runs to detect the issue. Such inconsistencies can confuse users, who expect consistent outcomes from identical actions. Consequently, bug reports may lack clarity, complicating troubleshooting efforts.

Addressing E2E instability is essential to identify and rectify these issues before they impact customers. The benefit of investing effort into this process is to maintain the perception of reliability for our application.

Improvement Strategies

We have several strategies to enhance the stability of our E2E tests.

Ensuring High Quality

As noted earlier, even tests that fail only once per thousand runs can lead to instability when scaled to hundreds. Fortunately, my practical experience suggests that instability is rarely uniformly distributed. Typically, a small number of unstable tests are responsible for most suite failures. Thus, focusing on resolving the most frequently failing tests can substantially improve overall stability.

Segregating Tests in CI

Recently, I transitioned my project from running all E2E tests in one job to executing separate jobs for various E2E-related tasks across different application segments. This change resulted in several benefits:

Parallelization is cleaner since backends and databases are not shared between distinct E2E runners, eliminating data leakage risks.
You can pinpoint which component is failing directly within the CI interface, simplifying the assessment of whether a test failure is valid or spurious.
Rerunning tests becomes straightforward; only jobs that failed need to be retried, not the entire suite.

Cautionary Approaches

In addition to the recommended solutions, a few strategies seem more like temporary fixes.

Automated Reruns

I have consistently opposed automatic test reruns. My primary concern is that this practice encourages developers to overlook nondeterministic test failures, leading to unresolved E2E issues and genuine code problems that may affect users.

Targeted E2E Testing

When developing, I manually select which E2E tests to execute—those most likely to be impacted by my changes. As test suites expand, execution times lengthen, exacerbating stability issues. It can be tempting to implement smarter CI processes to identify which tests to run based on anticipated changes. However, this approach carries several risks:

Code or tests may deteriorate without direct modifications—unforeseen changes in browsers or library updates can affect behavior unexpectedly. Continuous testing helps catch these issues promptly.
Running tests on unchanged code aids in evaluating overall test stability and identifying problematic tests.
Attempting to optimize which tests to run may lead to occasional misjudgments, bypassing quality control.

Quick Failures

If a test suite fails upon the first test failure, is it necessary to continue testing after that point? Faster failure could allow for quicker reruns and conserve CI resources. However, I prefer not to halt E2E testing after a single failure for several reasons:

When troubleshooting tests, having the complete context is crucial—especially to determine whether one or many tests require fixing.
I want to understand overall failure rates and interdependencies. Early failures might obscure instability in tests executed later in the suite.
I once encountered a situation where the database initialized in one of two states, leading to one set of tests failing under one condition and another set under a different condition. Due to early failures, I failed to recognize the underlying connection, leading to prolonged investigations into seemingly unrelated issues.

Further Exploration

Are you eager to learn more about testing or programming? Sign up here to receive updates when I publish new content.

Originally published at https://how-to.dev.

More content available at **PlainEnglish.io*.*

Interested in scaling your software startup? Check out **Circuit*.*

jkisolo.com

Understanding E2E Test Flakiness and Strategies for Improvement

The Challenge

Frustration

Random Variability

Trust Erosion

Statistical Overview

False Positives

The “Paradox”

Probability Insights

Causes of Flakiness

Lack of Isolation

Data Leakage

Random Test Issues

Random Application Issues

Improvement Strategies

Ensuring High Quality

Segregating Tests in CI

Cautionary Approaches

Automated Reruns

Targeted E2E Testing

Quick Failures

Further Exploration

Share the page:

Recent Post:

Maximize Your Interview Success with AI: A Comprehensive Guide

Exploring Active and Passive Learning in Cybersecurity

Embrace Your Decisions: The Stoic Path to Fulfillment

Exploring Meditation, Dissociation, and Embodiment: A Critical Look

The Enigmatic Nature of Dark Matter: Unveiling Cosmic Secrets

The Impact of Narcissism on Mind, Body, and Soul Explained

Finding Strength in Loss: The Hidden Benefits of Near Misses

The 5 Most Overrated Books You Should Skip Reading