How AI can support accessibility

Accessibility tests check whether web applications meet the criteria of WCAG, an international catalog of standards with over 80 test points. Around 20 to 25 of these can be automated using traditional tools. AI language models can also adopt language-related criteria, such as whether headings match the page content. Clustering algorithms help to select meaningful test representatives from a large number of pages.

Key Takeaways

Accessibility tests today are carried out with an unproductive mix of tools consisting of Word, Excel, screen reader and browser, which should be replaced by an integrated test environment without media breaks.
Of over 80 WCAG criteria, only around 20 to 25 can be tested automatically with existing standard tools, the rest still require manual work.
GPT-4 can reliably evaluate linguistic WCAG criteria such as the consistency of headline and text content, while visual checks such as recognizing broken layouts do not yet work sufficiently well with current models.
A crawler combined with clustering algorithms reduces the manual effort involved in accessibility audits by automatically grouping pages according to relevant attributes such as embedded PDFs or videos and suggesting representatives.

Accessibility testing will be mandatory for many companies from 2025

From June 2025, the European Accessibility Act will also force private companies to make their digital offerings accessible. This obligation has been in force in the public sector for some time. Banks and insurance companies have been preparing for this for years because violations of the rules have consequences.

The effort involved is considerable. Test orders in the order of many man-days appear on tender portals. Anyone who has to test a large web application is not dealing with a few random samples, but with a structured process across many pages.

The technical basis is provided by WCAG, an internationally recognized catalog of criteria. With over 80 criteria, it describes what must be fulfilled for a website to be considered accessible. There are also national regulations and EU law that take up this standard.

Which accessibility criteria can be tested automatically?

Of the more than 80 WCAG criteria, around 20 to 25 can currently be automated using standard tools. There are open source tools that cover these tests and are already being integrated by many teams.

These tools mainly work with data that can be extracted from the rendering of a page. Contrast values between text and background are a typical example: they result directly from the displayed layout and can be checked objectively.

The rest remains manual work. Realistically, no process will be able to automatically check all 80 criteria in the foreseeable future. Some of them require human judgment, for example when it comes to interpretation and context.

How an accessibility audit works

An audit follows a defined process model that exists as a supplement to the WCAG and describes in rough steps how to proceed. The process is comprehensible and contains nothing magical.

The individual steps:

**Get an overview of how the application is structured.
**Identify technologies: Check which technologies are used on the site.
**Identify page types: Find out what types of pages there are, about 20 different types instead of 10,000 individual pages.
**Select representatives ** Draw a sample per page type and document each decision so that the result can be traced later.
**Check criteria ** Go through the criteria catalog for each representative.
**Document ** Write down the result.

The documentation runs through the entire process. Which page types were selected and why must be recorded, otherwise the audit is not verifiable.

The biggest pain is in the jumble of tools, not in the audit itself

Anyone testing accessibility today is juggling a multitude of tools. Findings end up in Word, page lists in Excel, the audit itself requires a screen reader, an open source test tool and, of course, the browser. Screenshots are taken, pasted and copied.

These media disruptions cost time and nerves. Especially when many pages need to be checked, the back and forth between the applications becomes the real obstacle, not the technical check.

A comparison with software development makes the problem clear. Developers have a similar starting point: code, compiler, test frameworks, execution environment, database tools. Modern development environments bundle all of this into one interface via plug-in architectures.

It is precisely this principle that is missing in accessibility testing. An integrated test environment in which testers can work without media discontinuity, link tools and add new functions without leaving the working context would provide the greatest leverage. By the way: a dark mode is a matter of course for most users.

AI helps where language understanding is required

Artificial intelligence plays to its strengths in accessibility tests, especially in linguistic tasks. Current language models are already good in the language domain, and this is where their use really pays off.

One concrete example is the criterion that a headline must match the associated text. This test is not trivial or unambiguous, even for humans, and there is room for interpretation.

A practicable approach looks like this: A tool extracts the headings for a given page together with the associated text modules and sends a corresponding prompt to a language model. The model evaluates how well the headline and text fit together. The results from such evaluations are consistently good enough to be waved through productively.

Despite all the enthusiasm, it pays to be cautious. AI is not an all-purpose solution for every test problem. Anyone who puts AI on everything today is more likely to make themselves suspect because the topic comes with too high expectations. The better question is: What specific problem should the tool solve at this point?

Why visual error detection using AI is not yet viable

Having an AI model reliably recognize whether a page is visually broken did not work at the first attempt. The idea was appealing: take a screenshot and the model reports a broken layout or other visual defects.

Such models first have to be trained, and this requires labeled data. Artificially broken pages were shown to customers in a small web application over several weeks, with the question of whether the page looked broken or not. The labeling ran as a competition with a prize draw.

Despite this training data, the recognition accuracy was not sufficient in the first step. The goal behind this remains attractive: a sensor that runs in the background during every test execution and notices that a tested application is visually broken. Most automated tests do not check such non-functional aspects on a large scale.

How an intelligent crawler speeds up page selection

Gaining an overview of all pages of an application is one of the most time-consuming steps in the audit, and this is where automation helps most directly. Manually creating a map of all pages is tedious.

A crawler, which runs through the application starting from one or more start URLs, takes over this mapping. Crawlers are not a new invention, and the old problems remain: When do you recognize that you have landed on the same page again, even though only a date, a time or a displayed advertisement has changed? This state abstraction is the core difficulty, as is the handling of forms and form data.

Instead of making the crawler smarter and smarter, a combination with clustering helps. The crawler first starts off, then a clustering algorithm groups the pages found based on attributes that are relevant from an accessibility perspective, such as whether a page contains a PDF or a video.

This makes it possible to find meaningful representatives without testing ten similar pages twice. This saves work and provides the tester with prepared groups: here all pages with PDFs, there all with video. Collecting and pre-sorting in this phase takes a lot of work away from the human tester, precisely because manual testing remains afterwards anyway.

Local language models still fail due to data privacy requirements

OpenAI is currently the gold standard for linguistic checks because no other model can handle these tasks so well. This brings with it two problems.

Firstly, the API connection is not stable enough. If the status page turns red, this has a direct impact on your own application. Secondly, when it comes to data privacy, the question quickly arises as to where this data is actually sent, especially if customers are restricted to the German data space.

This is why the local model is an important topic for the future. The hope is to retrain a generally trained open source language model such as Llama for the specific use case. This has not yet worked well in the first step; OpenAI is clearly ahead here.

Deliver iteratively and involve users early on

The most sensible approach is to deliver an initial set of functioning tools and incorporate user feedback directly. Instead of waiting for the big, finished system, four or five additional automatable criteria are first put into the hands of the testers.

The message to the user is: here is a set of tools that works. These criteria can be tested automatically in addition to what was previously possible, work with it and tell us how it feels. Everything else will be added step by step.

This approach keeps development close to users’ real problems. Before every AI function, the question is what the tool actually does better, not the technology as an end in itself. One open question is how transparent the use of AI should be: Are users simply happy to see a button called “Analyze”, or do they want to know what is happening in the background?