Fair, good AI?

Fair AI refers to the requirement that AI systems demonstrably do not discriminate and fulfill socially accepted values. Because fairness measures can be mathematically contradictory, fairness cannot be solved purely technically. Instead, an assurance case framework structures the requirements: from a main assertion to testable sub-assertions to provable evidence such as test results or functionalities.

Key Takeaways

Fairness measures in AI are mathematically contradictory: if you optimize one, you inevitably worsen another, which is why computer scientists must not make this decision alone.
The assurance case approach from safety engineering can be transferred to fairness: a main assertion is broken down step by step into testable sub-assertions and provable evidence.
In a real industry project with a medical rotation planner, the framework showed that around a quarter of the necessary fairness measures were not even in the development plan.
Stakeholder surveys uncover fairness requirements that go far beyond non-discrimination, including transparency, autonomy and practical aspects such as low waiting times between departments.
A structured argumentation about the assurance case protects developers legally, because they can prove that they have proceeded to the best of their knowledge and according to the state of the art in the event of allegations of discrimination.

Fair AI means first defining what fair means

In computer science, fairness cannot be defined by developers alone. It is a social requirement that must be translated into testable criteria, and this translation step does not belong in the hands of programmers.

The reason lies in the spread of technology. Software has always been embedded in social processes, but AI is being used in so many areas that requirements are suddenly emerging that no one has thought about before. People who build AI have five to seven years of computer science studies behind them, but no ethical or social training and no domain knowledge in medicine, for example.

Marc Hauer describes the core problem using the example of personnel selection. In the past, no one looked at every single decision made by HR staff, and there was no need to: a human might make ten or twenty such decisions a month. An AI scales the same logic to a completely different quantity. It is precisely this scaling effect that forces us to clarify the concept of fairness in advance.

Why there is no objective fairness

Fairness cannot be unambiguously resolved mathematically because different notions of fairness can contradict each other. Optimizing one measure runs counter to another.

A simple example makes this clear. In the case of child benefit, a distribution is considered fair if everyone receives the same amount of resources. In social welfare, it is considered fair if a minimum level is reached for everyone after the allocation. Both seem intuitively fair, but both principles cannot be fulfilled at the same time.

The computer scientist Kleinberg has shown this tension for three clearly defined measures of fairness: they contradict each other. This is precisely why it should not be up to computer scientists to decide what is fair in a specific case.

There is another pitfall. Anyone who is simply given the task of building a fair system can end up calculating all available fairness measures and will almost certainly find one that delivers a good value. This is not a reliable statement about fairness.

How the assurance case breaks down fairness into testable requirements

The approach comes from safety engineering and is called the Assurance Case Framework. It turns an abstract assertion into a structured argument that can be proven and challenged.

It starts with a main assertion. In safety engineering this is “My system is safe”, in the fairness context it is “Our system is fair”. This is followed by an argumentation level that asks: Which sub-claims must be fulfilled for the main claim to be valid?

These sub-claims are broken down further until they are small enough to be substantiated. Depending on the case, the evidence may be

an existing functionality of the software,
Test results and comprehensible test processes,
References to scientific publications,
existing technical processes, such as the ability to contact the responsible planner.

The structure also includes openly stated assumptions. An assumption can be that you work exclusively with fairness measures and do not consider anything else. If such an assumption is openly noted, a reviewer can challenge it. This is the real value: the argumentation can be checked, criticized and sharpened from the outside.

Stakeholder survey instead of metric optimization

Fairness becomes concrete as soon as you ask those affected and users what they would even recognize as fair. The result usually goes far beyond non-discrimination.

Tobias Krafft has tested the framework outside the safety world in an industry project for a medical rotation planner. Medical students rotate through different hospitals and departments for set periods of time. An AI component was to create the plans for these rotations.

Stakeholder interviews revealed several requirements that they considered to be part of fairness:

Non-discrimination
Transparency
Autonomy and control over their own plan
short waiting times between departments
having to move as rarely as possible

Each of these requirements became a sub-assertion under the main assertion “Our system is fair”. For the requirement transparency and control, this specifically meant: the student can view their plan, see KPIs such as average waiting times, can challenge their plan and express criticism, and there are contact options with the planner who ultimately decides on the plan.

The assurance case uncovers gaps before the system is ready

The greatest practical benefit became apparent during development: the assurance case makes it possible to see which requirements have not yet been planned.

Stakeholder acceptance was the most important goal in the rotation project, which is why fairness was assured during development, not afterwards. At the time of publication around nine months ago, around half of the evidence could not yet be provided because the software was not yet ready.

The second finding was more striking: around a quarter of the necessary evidence was not even in the development plan. The assurance case was used to identify new requirements and functionalities, which were then included in the plan. One example was the ability to compare two alternatively generated plans in parallel and see directly where one is better than the other.

AI makes discrimination visible that was already there before

The debate about fair AI often overlooks the fact that human decisions were rarely better beforehand. Humans had their bias and their tunnel of thought, but nobody looked systematically.

When AI is used, a closer look is taken at whether an outcome is fair or discriminatory. This scrutiny reveals problems that would simply go unnoticed in the purely human process. The key difference is that with AI, you can identify the levers, understand them and challenge the result.

The media debate paints a one-sided picture. There are reports of cases of discrimination, denied social assistance, money reclaimed. Where AI is useful, it rarely appears. This imbalance can lead to economic losses as soon as attention is focused on an incident.

A documented argument protects the developers

A state-of-the-art assurance case serves as a basis for argumentation if an allegation of discrimination arises. It shifts responsibility from the shoulders of the developers to the place where it belongs.

Until now, it has been said: You are responsible for your product. But as a computer scientist, I have learned to develop, I have no ethical training and no domain knowledge in medicine and I don’t even know what consequences this can have. Marc Hauer

With the documented version of the assurance case and all the evidence and assumptions, you can say in the event of an incident: this was done to the best of our knowledge and belief, this and that was done, and as things stand today, no better way is known. This lowers the barrier to using AI in areas where fairness plays a role.

Not every application requires this effort. If an AI checks on the assembly line whether a screw meets the standard, non-functional requirements such as fairness are irrelevant. This is exactly where fast and agile work is needed.

Regulation by risk instead of by watering can

The European approach regulates AI on a risk-based basis and not all the same. This shifts the effort to where a decision becomes critical.

The AI Act provides for risk classes. A closer look is taken where an AI is involved in decisions on human rights or fundamental issues. Where the added value outweighs the critical consequences, there is room for fast work.

AI never operates in a legal vacuum anyway. If an AI supports a doctor, the existing set of rules for medical practice continues to apply. In this respect, risk assessment does not necessarily lead to overregulation, but rather transfers existing legal registers to a new tool.

Templates and open source as the way forward

An assurance case cannot be transferred one-to-one from one project to the next, but useful templates can be prepared for certain areas of application. This is the lever for making the method usable beyond individual projects.

There are two scientific publications on the state of dissemination and a guidebook that explains the approach in a way that is more suitable for mass use. The method is anchored as a proposal in the AI standardization roadmap. A working group on the fairness of AI in financial services is developing a template for fairness aspects in lending systems.

The biggest practical hurdle is getting the right stakeholders around the table. This is difficult within the narrow project framework, and the open call to the community brings together many competing ideas.

As a long-term goal, Tobias outlines a common task framework along the lines of image recognition: the community receives a sufficiently specified use case, builds assurance cases for it, shares and improves them and selects the best one after one or two years. It is known from the safety sector that the method works, and this experience could be widely shared in an open source model.