Legacy apps automated

Legacy test automation with AI refers to the approach of testing applications without accessible element IDs using a visual model. An AI tool analyzes screenshots and finds elements based on natural language descriptions without cutting out reference images. This reduces manual test cycles from several weeks to just a few hours.

Key Takeaways

Manual regression testing of Deutsche Bahn’s mobile checkout app took two to three people over two weeks, automation with AskUI reduces the same test run to around three hours.
AskUI recognizes UI elements purely visually via screenshot and AI model without accessing element IDs, which makes the tool usable for applications that classic frameworks such as Appium or Selenium cannot automate.
If you don’t anchor element IDs in the code, you are blocking later test automation, which is why Deutsche Bahn now makes automation a requirement for service providers for all new or purchased applications.
The biggest open problem with AI-based UI testing is the speed of execution: each step requires an inference call against a model, which makes the suite significantly slower than conventional frameworks.

Why legacy software fails in test automation

Legacy applications often cannot be controlled with classic automation tools because they lack the technical anchors. Selenium, Appium or TestComplete access element IDs. If these IDs are missing, the entire approach comes to nothing.

This is exactly what happened with Deutsche Bahn’s mobile checkout. This application is used by long-distance employees to sell coffee, beer or other items and book card payments. The software is Android-based, runs on .NET 6 and was purchased, not developed in-house. Nobody thought about automation at the time.

The result: no clearly addressable elements, no standard tool that works. Several proof of concepts with open-source tools failed, simply because the application did not provide any element IDs.

Manual testing costs two to three weeks

Manual regression testing of the mobile checkout takes two to three people over two weeks. That’s the starting point where the suffering begins.

The problem worsens with every discovery. If bugs appear after two weeks and a new version is released, the regression has to be run again. This cycle is hardly sustainable in practice if releases are to follow each other in quick succession.

Deutsche Bahn now starts earlier with newly purchased or self-developed software. Service providers and teams must ensure quality from unit testing and interface testing to all other levels before acceptance testing takes effect. This foundation does not exist for legacy applications such as the mobile checkout, hence the problem.

How visual AI selectors drive legacy interfaces

AskUI’s approach completely dispenses with element IDs and instead works via screenshots. A controller, the so-called AgentOS, is connected to an inference that bundles several models: OCR, image recognition, an LLM and multimodal models.

The tool takes a screenshot of the operating system at runtime and finds the control elements on it. There is no element matching and no click recording as with older tools.

The difference to earlier image-based approaches is important. Old tools cut out individual element images and compared them pixel by pixel. As soon as something changed on the screen, the entire route broke. AskUI does not cut out any elements.

Instead, the model has learned in advance what a login button looks like. You only describe the action, not the appearance. Jonas Menesklou puts it in a nutshell:

How ChatGPT understands text, we understand images.
Jonas Menesklou

From test case to execution: two paths

There are two ways to build test cases, depending on the technical requirements. Deutsche Bahn uses the code path via a framework with TypeScript and Node.js for the mobile checkout.

In the code, the AskUI selector is just another selector in the library. Instead of an element ID, you write a linguistic description: click on the green login button in the top left corner. This description becomes the selector at runtime. The library can be integrated into PyTest, TypeScript testing and other runners.

For less technical users, there is a no-code option via a CSV. If the test cases are in a test management tool with a test case ID and natural language step-by-step description, an LLM reads this description and converts each step into an action. You upload the file and start the suite.

The selector works like a virtual tester with an understanding of interfaces. You describe the element as you would explain it to a person seeing the screen for the first time.

Maintenance via training instead of code conversion

If the interface changes, the adaptation effort is minimal because the entire code does not have to be touched. The tool automatically picks up the new screens and makes some of the changes itself. The test logic is only adapted for the changed case.

There is a separate training tool for detection errors. If a screenshot provides a blurred element and the model does not recognize “registration”, for example, you can teach the tool exactly this assignment. This allows you to manually expand the underlying model if the automatic recognition fails.

This ability to refine the model yourself sets the approach apart from pure black box tools. You are not dependent on the manufacturer covering every special case.

What figures from practice show

The mobile checkout comprises 270 longer test cases at acceptance level, 60 of which are currently automated. These 60 cases run automatically in around three hours. Manually, a tester needs at least eight hours for around 50 cases.

The goal is much more ambitious. If 210 to 220 of the 270 test cases are automated and run in parallel across several devices, the entire suite should run through in one to one and a half hours. This would save around half of the current resources.

The mobile cash register cannot be fully automated. It uses peripheral devices: a printer for the receipt and a device for scanning the credit card. These peripheral tests are not yet covered, but make up the smaller part.

Railway-specific devices with their own certificates are an additional hurdle. The app does not run on any hardware, and testing against emulators also fails due to these certificates.

Speed is the open construction site

The biggest weakness of the AI-supported approach is performance. Because each step processes a screenshot and visually finds a selector, execution is noticeably slower than with a classic tool. Umar Usman Khan puts it clearly: if the application could have been automated with Appium, he would have used Appium, even though AskUI requires less code.

We are actively working on this. The models run on Deutsche Bahn’s own inference infrastructure instead of in the cloud, which provides data control but costs communication time between the server and the tool. A newly built caching system keeps commands locally instead of sending every command to the server.

Other levers include compression, smaller models and as few inference calls as possible per run. The thrust: process as much as possible locally, less request-response traffic.

Why the partnership with a startup works here

Close cooperation with a young provider pays off for both sides. Missing features or problems from operations flow back directly, and the provider delivers fixes and new functions faster than would be usual with an established tool.

For something like this to work, it needs backing within the company. Proof of concepts cost an investment, and someone has to support this investment internally. This was exactly the case here, supported by team leaders and department heads.

For you as a tester, this means that a new, as yet unfinished tool can be the right way to go if the established tooling simply fails in the application. The leap from two to three weeks of manual effort to a few hours justifies actively helping to shape a growing product instead of waiting for the perfect off-the-shelf solution.