Game Testing

Game testing for an MMORPG means testing the quality of a game that has grown over decades with new patches every week. Exploratory testing stands alongside checklists, god spells replace tedious leveling, and twice a year up to 70,000 real players test new features in a separate test environment before the official update is released.

Key Takeaways

Exploratory testing is more important in testing the MMORPG Tibia than simply checking against requirements documents, because many risks are only hidden between the lines of the design documents.
A bug that causes players to die unjustifiably is not a cosmetic problem: in Tibia you lose progress that costs you a week of hard game time, which leads directly to migration.
During external testing with up to 70,000 invited players, the critical mass of users uncovers bugs and balancing issues that the internal team could not foresee.
Without in-game testing tools, so-called God Spells, it would be practically impossible to specifically create game states and test quests in the middle of their sequence without having to play for hours beforehand.
The release decision at Tibia is not based solely on green checkmarks in checklists, but also on the personal gut feeling of the testers who have tested the respective feature.

Why games need to be tested

An online game without testing puts its players at high risk, and that ends up costing trust and money. In a massively multiplayer game like Tibia, which has been running since 1997, every mistake weighs heavily because players invest real time in their characters.

If a character dies because of a bug, it’s costly in the game. Oliver Heldt describes it like this: “You spend a week playing towards something and then you die because there’s a bug in the code. Anyone who experiences this loses interest in the game.

The reasoning is therefore both economic and emotional. A studio wants to keep its players and deliver a product with reliable quality. The argument “a game is not vital, so you don’t need to test it” falls short as soon as a community is attached to a product for years.

What makes a 25-year-old game testable and what doesn’t

The biggest challenge with an old game is the sheer amount of grown content that has to be considered with every change. Tibia receives a weekly patch, plus interim releases and major updates such as a summer update.

Such an update brings new adventures, new functions and new logics that have to work in parallel with the existing ones. The test team works on a major update for around seven weeks and divides up the work so that the necessary quality remains achievable during this time.

The central question is not “have we tested everything”, but “how deeply do we test the old when something new is added”. You can’t completely re-test a game that has grown over 25 years with every release. So you consciously decide where the risk lies and how far you go into it.

Exploratory testing beats stubborn ticking off

Exploratory testing is very important in game testing because risks can rarely be fully identified from a requirement. There are basic design and implementation documents that are tested against, but that is only one side.

There are many exploratory tests for a single requirement. The real work consists of reading between the lines of the documents and finding out where a risk is hidden. Simply checking whether something is red or green is not enough for a lively online game.

Product knowledge helps, but is no substitute for intuition and fresh perspectives. New colleagues bring a different perspective, and it is precisely this input that uncovers things that a well-established team overlooks.

How checklists act as inspiration instead of a compulsory program

Checklists serve primarily as inspiration during playtesting, not as a compulsory program to be completed. For clearly defined cases, such as a new outfit or a new mount, the team uses experience-based checklists that cover typical sources of error.

Smoke test lists also exist, but their purpose lies elsewhere. They are intended to suggest how deeply a new feature should be tested. It’s not about collecting as many green ticks as possible at the end.

Once the checklist for a new outfit has been run through cleanly, the test is not yet over. It provides a good basis, nothing more. The rest is decided by the question of the desired test depth.

A good bug report needs a repro that your grandma understands

The most important part of a defect report is a reproducible step-by-step guide that works without queries. Oliver Heldt clearly formulates his own requirements.

It would be cool if you could give it to your grandma and she would know roughly how to reproduce it. : Oliver Heldt

Such a repro can be extensive, because an error sometimes involves several players who perform certain actions with a certain timing. Pressing a button three times is not enough as a description.

The report includes step-by-step lists and screenshots, rarely videos. Videos tend to come from the players. The standard remains the same: the developer should have to ask as few questions as possible, and another tester should also be able to reproduce the error. An “internal error on the website” as the only information does not help anyone.

God-Spells: why testers are allowed to cheat in the game

Testers work with so-called god spells, i.e. built-in intervention options that make the test practicable in the first place. From the player’s point of view, this would be cheating; in testing, each of these spells has a clear purpose.

They range from teleporting instead of long runs to setting a certain level and querying and setting quest states. This way, you can quickly get a character into the situation you want to test and save a lot of time.

Without these tools, some tests would not even be possible. In an area full of monsters, a tester would simply die because they don’t have the level of experienced players. Some of them have been playing Tibia for ten or twenty years. The test is not about being better than them, but about being able to examine a logic in a controlled way.

There are also test characters and log files. A character with an extremely large number of items is well suited for a first run, because more can go wrong there. Prepared characters are available for targeted testing. Everything that helps to evaluate a behavior as right or wrong falls into the same manipulation and information track.

How 50,000 players become part of the test

Real players enter the game after the internal test, in an external test lasting three to four weeks. Between 50,000 and 70,000 players are invited twice a year, out of a total of around 500,000 active players.

There is a separate test server and website for this external test. The players are not given a checklist, they are asked to play the game as they normally would. From within the game, they create a bug report by right-clicking, which directly provides the location and description of the bug.

Mostly small things come back: a spelling mistake, a spot on a house wall where something is wrong, or a point where you get stuck. In other words, precisely the errors that were left internally as minor bugs. However, the critical mass of players also uncovers things that nobody had on their screen during internal testing.

The external test is also used for balancing. Is a boss too strong, are there too many monsters, does a new logic make one of the four vocations superior? Key figures can be used to make adjustments before the release, which is more difficult to evaluate internally.

Players try everything, so you have to think about everything

Players are actively looking for ways to outsmart the system, and it is precisely these attempts that the test must anticipate. What can be done, will be done.

For example, in front of an NPC who wants to take an item from the player, players try to throw the item on the ground beforehand so that the NPC can’t take it. Such tricks are tried out in order to get money or advantages that they are not entitled to.

These anomalies to the real world are the stuff that tests fail or pass. Often the reaction is honest: nobody had thought of that. It’s part of the job.

When a release is ready enough

“Done” in game testing is a state of data and confidence, not complete coverage. There is a due date, and usually by then the quality is at the desired level, every feature as deeply tested as planned.

The team makes the decision together, but it is based heavily on experience. The team leader explicitly asks the testers who have worked on a feature how they feel about it. This feeling counts alongside the green checkmarks and closed issues.

If there is not enough trust, there is a retest. Then the team has to be honest with developers and product managers and say that it needs a few more days because something was overlooked in the estimate. Because testing is done internally for your own product, this step is possible without having to argue against an external customer.

Even after 25 years, there is still enough technical testing to be done

An old game does not stand still, new technology is regularly added and brings its own test tasks. Tibia remains graphically in 2D, a leap to new graphics is not planned. Nevertheless, the technical scope is growing.

Examples include a dedicated app with game information and the complete soundtrack for the game, which previously had no sound. This soundtrack was a major test task because sound doesn’t just go on and off.

The sound follows a logic and has to match the location, which is different on a coast than at a well. Players can control the sound granularly, for example only switching off the weapons or only the spells of others. In testing, you then work a lot with log files and manipulation to evaluate whether the sound is the right one at this point, without having to wait hours for the next piece of music.

Well-known bug classics and what they teach us

Even carefully tested games don’t manage to catch every bug because not everything can be tested. This basic rule of testing is illustrated by specific cases from production.

A whoopee cushion could be stacked. If a player stacked too many on top of each other, a data type overflowed and the client or server crashed.
A cube allowed a player to hit another player on the head and then disappear instead of being punished as usual.
An event island that had been worked on for weeks was suddenly no longer accessible after an authorization change to the accessibility. This was only noticed during the external test phase.

These cases have a common pattern: a sensible change in one place breaks another that nobody thought of at the time of the change. Areas touched late in the release can no longer be completely retested, so something can slip through. The weekly patch rhythm cushions this because such errors can be rectified promptly.