Glossary

Agentic Testing: What It Is and Why It Replaces Self-Healing

Definition

Agentic testing is a testing approach where an AI agent autonomously generates, executes, and maintains end-to-end tests by interacting with an application the way a human would: visually, through rendered UI, using intent rather than code-level selectors. This is an architectural distinction, not a marketing label. Traditional test automation (Playwright, Cypress, Selenium) works by locating DOM elements via CSS selectors, XPath expressions, or data-testid attributes, then performing actions on those elements programmatically. Self-healing tools (Mabl, Testim, Healenium) sit one layer above: they still capture selectors, but use ML models to guess replacement selectors when the originals break. Agentic testing removes selectors from the equation entirely.

The clearest way to understand the difference is the intent-based vs selector-based spectrum. On one end, a Playwright test says: click the element matching 'button.primary-cta[data-testid="submit-order"]'. If a designer renames that class or a developer removes the data-testid, the test breaks. A self-healing tool might recover by finding a nearby button with similar attributes, but it is still searching the DOM for selector matches. An agentic test says: click the Submit Order button. The agent renders the page, identifies the button visually (by its label, position, and context), and clicks it. No selector is ever created, stored, or repaired.

This matters because the failure mode is fundamentally different. Selector-based tools fail when the DOM changes. Self-healing tools fail silently when they guess wrong, clicking the wrong element and reporting a pass. Agentic tools fail when they cannot confidently identify the intended element, which maps directly to what would confuse a real user. A button that moved from the header to a sidebar will still be found by an agent that reads the page. A button whose label changed from "Submit" to "Place Order" will trigger a confidence check, not a silent guess.

Why it matters

The testing industry spent a decade building and marketing self-healing as the answer to test maintenance. The pitch was simple: selectors break, our ML fixes them. But the track record has created a trust crisis. A 2023 Tricentis survey found that 46% of developers distrust AI-driven testing accuracy. That is not skepticism about AI in general, it is a specific response to tools that silently modify test behavior and report green when they should report red.

The core economic argument for agentic testing is maintenance cost. The World Quality Report consistently finds that teams spend 60-70% of their test automation budgets on maintenance, not new test creation. Most of that maintenance is selector repair: updating CSS paths after a UI refactor, adding new data-testid attributes when component libraries change, rewriting XPath expressions when the DOM restructures. Selenium users report 80% of their effort goes to maintenance and only 20% to writing new tests. This is a structural tax on engineering velocity that compounds as suites grow.

Agentic testing eliminates the primary maintenance driver by never creating selectors. But the shift is bigger than maintenance savings. It changes who can write and maintain tests. Selector-based tests require someone who can read and write CSS selectors, understands DOM structure, and can debug locator failures. Intent-based tests describe what a user does: go to the pricing page, click Upgrade, fill in the credit card form, confirm the payment. This means product managers can review test specs, designers can validate that flows match their intent, and junior engineers can contribute coverage on day one.

Moving from selector management to intent-based interaction also changes how teams think about coverage. When writing a test takes 5 minutes instead of 45, and maintaining it costs near zero instead of hours per sprint, teams test more flows. The bottleneck moves from "can we afford to test this" to "what should we test next."

How teams handle it today

The E2E testing landscape in 2025 sits across four distinct tiers, each with real tradeoffs.

Manual selector-based tools (Playwright, Cypress, Selenium) remain the default choice for most engineering teams. Playwright, in particular, has earned strong adoption with features like auto-waiting, trace viewers, and codegen. These tools give you full control and zero magic. The tradeoff: every test is a hand-coded artifact that must be manually updated when the UI changes. Teams with 200+ E2E tests routinely report that a single UI refactor can break 30-50 tests at once, each requiring individual selector repair.

Self-healing via ML locators (Mabl, Testim, Healenium) attempts to solve selector brittleness by training models to find replacement selectors when originals break. Mabl's approach records browser interactions and uses multiple locator strategies (CSS, XPath, visual) with fallback logic. Testim builds a weighted selector model that adapts over time. These tools do reduce maintenance for simple UI changes like class name renames. The failure mode is false confidence: when the tool "heals" to the wrong element, the test passes but validates the wrong thing. This is the "heals away real bugs" problem that surfaces repeatedly in developer discussions.

Intent-based and vision-based tools (testRigor, Momentic, Spur) represent the newer agentic category. testRigor uses plain-English specs and interacts with the app through a combination of visual and accessibility-tree analysis. Momentic uses a vision model to identify elements on rendered pages. Spur takes a similar screen-based approach. These tools remove selectors entirely, which solves the maintenance problem. The common concern is transparency: when the agent makes a decision, engineers often cannot see why. Momentic has started adding execution traces, but the industry standard is still closer to "trust the AI" than "verify the AI."

Managed testing services (QA Wolf) take a different approach entirely: human QA engineers, augmented by AI tooling, write and maintain your tests as a service. You get coverage without building a team. The tradeoff is cost (starting around $3,000/month) and dependency on an external team's velocity. For startups moving fast, the latency of communicating test changes to an external team can be a bottleneck.

How Zerocheck approaches it

Zerocheck is an agentic testing tool built around a specific thesis: autonomy without transparency is worse than no automation at all. The agent interacts with your application visually, reads rendered pages, and executes tests written in plain English. No selectors, no DOM dependencies, no data-testid attributes required.

What differentiates Zerocheck from other agentic tools is the confidence scoring and "fails closed" design. Every interaction the agent takes includes a confidence score. When the agent clicks a button, it reports how confident it is that it found the right element (e.g., 0.97 confidence for a clearly labeled "Submit Order" button, 0.72 for an ambiguous icon-only button in a redesigned nav). When confidence drops below a configurable threshold, the test fails instead of guessing. This is the opposite of the self-healing pattern, where low confidence triggers a guess and a green checkmark.

Failing closed means Zerocheck will never silently pass a test it is uncertain about. If your designer swaps a labeled button for an icon-only design and the agent cannot confidently match it to the original intent, the test flags it for human review. This catches real regressions that self-healing tools would paper over. It also means the first few runs after a major redesign might require human review of flagged steps, but that review takes minutes and prevents the false-confidence problem that erodes trust in AI testing.

Every agent decision is logged with a visual trace: what the agent saw, what it identified, what it clicked, and why. Engineers can review exactly how the agent interpreted each step, making the AI auditable rather than opaque. For teams with SOC 2 or compliance requirements, these traces serve as evidence artifacts attached directly to CI runs.

Agentic testing vs self-healing: the architecture difference

The terms "agentic testing" and "self-healing testing" are sometimes used interchangeably in marketing copy, but they describe very different architectures. Understanding the distinction matters because it determines the failure modes you will encounter.

Self-healing operates as a repair layer on top of selector-based automation. The underlying test still defines interactions in terms of DOM elements: click the element at #checkout-form > button.submit, fill the input at [data-testid='email-field']. The self-healing layer monitors these selectors and, when one fails to match, searches for a replacement using heuristics: nearby elements with similar attributes, elements with matching text content, elements in similar DOM positions. Tools like Healenium maintain a database of selector histories and use cosine similarity between old and new DOM snapshots. Mabl uses a multi-locator strategy, trying CSS, XPath, visual position, and accessibility attributes in sequence.

This architecture has a ceiling. It handles simple changes well: a class rename from .btn-primary to .cta-button, an added wrapper div that shifts element depth. It handles complex changes poorly: a form that gets split into a multi-step wizard, a navigation that moves from a sidebar to a top bar, a checkout flow that replaces inline fields with a third-party iframe. In these cases, the self-healing model either fails to find a match (causing a test failure, which is at least honest) or matches the wrong element (causing a false pass, which is dangerous).

Agentic testing has no selector layer to repair. The agent receives an instruction like "fill in the email field with test@example.com" and interprets the current page to find the email field. It uses the same signals a human would: field labels, placeholder text, position relative to other elements, and visual context. If the email field moves from a single-page form to the second step of a wizard, the agent navigates to that step and fills it in, because it is following the intent, not a DOM path.

The practical difference shows up most clearly during UI refactors. A team that redesigns their checkout page with a self-healing tool will see a mix of healed tests (some correctly, some incorrectly) and broken tests. They will need to manually verify every healed test to confirm it is still testing the right thing. A team using an agentic tool will see most tests pass normally (the agent adapts to the new layout) with a few flagged for review where the agent's confidence was low. The review burden is proportional to the actual ambiguity in the redesign, not to the number of selectors that changed.

The trust problem in AI testing

The 46% developer distrust statistic from Tricentis is not surprising when you look at how self-healing has been deployed. The core complaint, which shows up consistently in developer forums and team retrospectives, is: "How do I know the AI didn't heal away a real bug?"

Here is the concrete scenario. A developer changes a form validation rule, accidentally breaking the error message display. The submit button still exists, the form still submits, but error messages no longer appear. A selector-based test that checks for the error message element would catch this, assuming the selector still matches. A self-healing test that was checking for the error message might "heal" by finding a different text element on the page and reporting a pass. The real bug ships to production.

This is not a theoretical concern. Teams that adopt self-healing tools without strong review processes report discovering production bugs that their "passing" test suites should have caught. The problem compounds because each false pass reduces the team's trust in the suite, which leads to less attention paid to test results, which leads to more bugs escaping.

Transparency is the fix, not better AI models. An agent can be wrong, but an agent that shows its work lets engineers catch mistakes. This means three things in practice. First, every agent decision should be logged with enough context to understand why: what the agent saw on the page, which elements it considered, which one it chose, and its confidence level. Second, low-confidence decisions should be surfaced proactively, not buried in logs. If the agent is 73% confident it found the right element, that should be a yellow flag in the test report, not a silent pass. Third, engineers need the ability to pin specific interactions when they want deterministic behavior. If a test step is critical (like confirming a payment amount), the engineer should be able to say "this field must contain exactly $49.99" rather than relying on the agent's interpretation.

The tools teams actually adopt won't be the ones with the most sophisticated AI. They'll be the ones that make the AI's reasoning visible and give engineers control over the confidence threshold. This is why the "fails closed" pattern matters more than model accuracy: a tool that admits uncertainty is more trustworthy than a tool that always claims confidence.

When agentic testing fails closed

"Fails closed" is a concept borrowed from security engineering. A firewall that fails closed blocks all traffic when it encounters an error, rather than allowing all traffic through. Applied to testing, failing closed means: when the agent is not confident in its action, the test reports a failure rather than guessing and reporting a pass.

This is the opposite of how most self-healing tools operate. Self-healing tools fail open: when the primary selector breaks, they try alternatives and, if any alternative matches with reasonable similarity, they proceed and report success. The assumption is that healing is usually correct, so it is better to keep tests green than to generate false failures. In practice, this optimizes for CI dashboard aesthetics over actual test reliability.

Failing closed has a concrete workflow. When an agentic test encounters a step where confidence drops below threshold (say, below 0.85), several things happen. The test step is marked as uncertain. The test run reports a warning or failure, depending on team configuration. The visual trace shows exactly what the agent saw and why it was uncertain: maybe the button label changed, maybe two elements matched the description, maybe the element was partially obscured. An engineer reviews the flagged step, confirms or corrects the agent's interpretation, and the test resumes with updated understanding.

The objection to failing closed is: "Won't this create too many false failures after a big UI change?" In practice, the answer depends on the confidence threshold. A threshold of 0.85 will flag genuinely ambiguous situations (icon-only buttons, duplicate labels, significantly restructured layouts) while passing straightforward adaptations (moved elements, renamed classes, added containers). Teams typically see 5-15% of test steps flagged after a major redesign, with most requiring 30 seconds of review to confirm.

The alternative, failing open, has a hidden cost that is harder to measure: steady erosion of trust. When a tool silently heals and reports green, engineers stop looking at test results carefully. When they stop looking carefully, they miss the one time the healing was wrong. Failing closed keeps engineers in the loop for ambiguous situations while handling the clear cases automatically. The goal is not zero human involvement. The goal is human involvement only where it adds value.

Where the industry is heading

The term "agentic testing" barely registered on Google Trends twelve months ago. As of early 2025, it has measurable search volume and shows up in 10 out of 10 Google Autocomplete suggestions for related queries. This is not organic developer curiosity alone. Enterprise vendors are actively investing in the category.

UiPath, which built its business on robotic process automation (RPA), has been expanding into AI-driven test automation with agentic capabilities. Their pitch connects process automation agents to test automation agents, a natural extension of their platform. Mabl has been repositioning from "self-healing" to "intelligent test automation" with agent-based features. Perfecto (now part of Perforce) is integrating agentic capabilities into their mobile and web testing platform. Tricentis, the largest pure-play testing vendor, has been making acquisitions in the AI testing space.

Gartner's 2024 Hype Cycle for Software Testing placed AI-augmented testing in the "Slope of Enlightenment" phase, meaning the initial hype has faded and practical, production-viable implementations are emerging. Their recommendation: teams should evaluate agentic tools for new projects starting in 2025, rather than waiting for the technology to fully mature.

The developer adoption pattern is following a predictable path. Early adopters (2023-2024) were startups with small teams, no existing test infrastructure, and a willingness to try new tools. Early majority (2025-2026) will be mid-market companies with painful selector maintenance burdens and enough test suite complexity to justify switching. Late majority adoption will follow once agentic tools demonstrate reliability at enterprise scale, with SOC 2 evidence trails, audit logs, and integration with existing CI/CD platforms like Jenkins, CircleCI, and GitHub Actions.

The interesting competitive dynamic is that selector-based tools are not going away. Playwright, in particular, continues to improve and has a loyal community. The likely equilibrium is that Playwright becomes the "control" option for teams that want full determinism, while agentic tools handle the 80% of tests where maintenance cost outweighs the value of explicit selector control. Teams will use both, the same way they use unit tests and integration tests for different purposes.

Related terms