Test automation of mobile apps

Test automation for mobile apps succeeds if it pursues two clear goals: non-technical testers can write test cases themselves, and device diversity is systematically covered. A keyword-driven approach with a robot framework makes both possible. It is also crucial to build a stable infrastructure before quantity counts.

Key Takeaways

Test automation that generates more effort than it reduces and cannot be trusted is better to stop completely than to continue.
Before the second attempt, Felix Doppel and his team created a catalog of requirements with weighted criteria and requested a proof of concept with 30 test cases before hiring a service provider.
Keyword-driven testing with Robot Framework allows manual testers at HUK Coburg to write and execute test cases themselves without having any development knowledge.
Device diversity on Android cannot be solved by more testers, but only by a physical mobile device cloud with prioritized device packages based on real user data.
Isolated test automation teams fail because the technical knowledge of the manual testers and the technical knowledge of the automation experts only result in test cases that work together.

Why the first test automation failed

Test automation that generates more effort than it saves has failed to achieve its purpose. This is exactly what happened when testing the mobile telematics app of HUK-Coburg. The team started automation shortly after the start of development in 2018 without thinking about the tool, approach and goals beforehand.

The result was green test runs that nobody trusted. While the automation ran through completely, the manual regression testing found 20 to 25 critical errors. So the tests checked something, but not what was important.

As a result, both promises of automation were lost. It was supposed to relieve the burden on manual QA and provide security. Instead, it tied up additional capacity and did not provide a reliable signal. Felix Doppel, tester at HUK-Coburg, attributes part of this to the level of maturity: the team started too early, before it had found its roles and tasks.

Pulling the plug is sometimes the more honest step

Keeping a failed automation system running just because there is already a lot of money in it is an expensive reflex. The team decided otherwise and stopped the automation completely in mid-2022. Four years of work were deliberately ended.

The trigger was a sober cost-benefit calculation. In an agile project, it is not important to have automation “on the banner”, but that it brings something. As the manual testers were strong, the team was simply better off without automation.

That wasn’t easy. As expected, management asked about the money already invested. The discussions were tough and it was painful to admit that the investment had yielded nothing at that point.

The hard cut was followed by a deliberate break of around four to five months without any automation activity. The team used this time to come to terms with the failure instead of immediately starting the next attempt.

First goals, then tools: the second attempt

The second attempt did not start with a tool, but with a catalog of requirements. This contradicts the agile reflex of trying things out quickly, but was necessary given the scope and cost of automation. It was precisely this preliminary work that was missing from the first attempt.

In internal workshops, the team asked the various roles directly: what does a product manager need, what does a test manager need, what do manual testers need, what does development need? This resulted in some competing objectives that had to be resolved and redefined. In one workshop, the team even had several groups design pseudo-solutions on a greenfield site.

In the end, there were two clear goals. Firstly, a non-technical user should be able to specify and execute test cases themselves so that manual QA can actively participate. Secondly, the variety of devices should increase in order to ensure quality.

The target level for the degree of automation was deliberately low. Instead of striving for as much as possible, the team aimed for 60 to 70 percent, but stable and on around ten different devices.

Why device diversity becomes a risk with an insurance app

With the telematics app, an insurance product is linked to every functional error, which makes the variety of devices a real risk. The app uses a sensor to record driving behavior and evaluates speed, acceleration, braking and steering behavior. If you drive safely, you save on premiums.

If the driving recording does not work, users are notified immediately. If the recording does not work for around three weeks, this escalates via customers and support to the department heads.

iOS is still easy to cover: a new model every year, few display sizes, usually the latest operating system. Android is a different story. There is a three-digit number of combinations of devices, types and operating system versions. You can’t test this number manually, no matter how good your testers are.

Why keyword-driven was a better fit than behavior-driven

Which automation approach fits is decided by the team, not the textbook. The first attempt used Cucumber with Gherkin syntax, i.e. a behavior-driven approach. The second relies on Robot Framework and a keyword-driven approach.

The team did not know in advance that the keyword-driven approach would be more practicable for their people. This insight could only be gained through trial and error.

A common mistake: Behavior-driven development is often introduced enthusiastically and then all acceptance criteria are pressed into Gherkin without asking who in the team can even write it. Not every task can be meaningfully formulated in this way.

Robot Framework had a practical advantage for this environment. Android and iOS share the same test flow, with differences only at the lowest level. Everything remains the same via the business keywords, and test cases can be formulated in natural language and test data-driven without anyone having to program methods.

This is how the proof of concept went

A tool proves itself not on paper, but on its own test cases. Instead of adapting to the technology as they did the first time, the team reversed the order and requested real test cases first.

The external tender had two conditions: The catalog of requirements had to be met, and there had to be a proof of concept with at least three submitted test cases. The team only gave the green light after reviewing these test cases.

The contract was awarded to the service provider imbus. Christoph Singer, consultant at imbus, describes the initial situation as unusually advanced because the requirements catalog and test cases were already at a high level of maturity. This step often has to be made up for at the customer’s premises instead of using the next best tool from a Google search.

At the end of 2022, a larger package followed as a stress test: around 30 test cases within about two months. This made it possible to test whether the approach would work not only in three examples, but also in the masses.

Quality is created in a team, not in an isolated automation team

From HUK-Coburg’s point of view, an isolated automation team to which test cases are only thrown is one of the main reasons for the first failure. If automation engineers work in isolation and only receive finished test cases “over the fence”, there is no professional feedback.

The leverage lies in the interplay of strengths. An automation specialist is technically strong, but often does not have a deep enough understanding of the subject matter. A manual tester understands the test case because he has executed it a hundred times, but does not have the technical tools. Bringing these two sides together is the real success factor.

You have to take the entire team with you. If you isolate that, the test automation engineers feel isolated and they don’t support each other.
Felix Doppel

For the manual testers, the keyword-driven approach did not mean a major break because their test cases were already structured in this direction anyway. In a joint workshop, they created their first test cases, and with the existing keywords, they quickly got a feel for what was already there and what was still missing. If a keyword was missing, the testers passed the technical ball back to imbus.

Close communication was important right from the start. Interim results were presented regularly and checked for comprehensibility instead of presenting a finished result at the end.

How far automation has come today

Around a third of the regression testing is currently automated, partly fully and partly only partially. The regression testing comprises around 150 test cases. The team believes that the target of 60 to 70 percent is achievable by the end of the year.

It is deliberately not the sheer number of cases that counts. 80 automated but unstable (flaky) test cases would be a nice number for management, but still worthless. Stability beats quantity.

Fully automated means that the manual test case can be completely eliminated. However, many things can only be partially automated. Some features are left out: The journey recording itself or the pairing of the telematics sensor cannot be meaningfully tested automatically.

In addition to the test cases, the infrastructure around them was also created: interface testing and the connection to a mobile device cloud. Maintenance is also a permanent item. The 30 test cases from the end of 2022 already had to be maintained again in spring 2023.

Real devices instead of simulators, controlled via three packages

Automation runs on physical devices in a mobile device cloud, not in the simulator. The simulator makes many things easier, but side effects and reliable results can only be seen on real devices.

The device selection is prioritized into three packages:

Package	Importance	Claim
Prio 1	most frequently used devices (especially Android)	testing must run here
Prio 2	still frequently used devices	the tests should run here
Prio 3	less frequently used devices	run sporadically

The selection is based on real usage data, not chance. The team monitors in the background which devices the users are using and looks at this on a monthly basis. This determines which devices end up in which package.

Insurance security requirements also come into play here. Test devices are updated automatically, which is why there are no longer any physical test devices for Android 7, for example. The app supports Android 7 and iOS 14 and higher, and older versions can be covered specifically via the cloud.

One practical advantage of the Mobile Device Cloud is the reporting. There is a video recording for every failed test case, which can be used to trace where things went wrong. With the previous purely technical solution, a stack trace remained instead, which ended up back with the developers.

Introduce slowly instead of losing trust early on

The automated tests are not yet fully integrated into the regression testing. They run in Jenkins, currently nightly, but not yet automatically with every new build. This is a conscious decision.

If the team were to activate the 50 to 60 or so automated test cases immediately, they would have to remove them from the manual test. This step is only taken once the infrastructure is completely in place, i.e. the connection to Jenkins and Jira is in place.

The reason for this is the experience gained from the first test. If the test cases are not stable enough, the mood in the team will quickly change again. That’s why automation should only be put into full operation when the surrounding infrastructure is reliable.

The frequency of execution also requires a sense of proportion. Running with every change drives up maintenance. Meaningful times and releases with a suitable focus are more important than constant running.

Automation is already having an effect. The regular runs have already found two to three critical errors that would otherwise only have been noticed during regression testing. Finding errors earlier is once again a question of money.