Static analysis with AI

AI-supported correction of findings from static code analysis means having detected quality problems automatically corrected by a large language model. The affected code and the problem description are passed to the model, which generates a correction proposal. In around two thirds of cases, current models deliver usable results; in the remaining third, the output is faulty or invalid.

Key Takeaways

AI-supported fixes based on static analysis findings work well enough in around two thirds of cases to be adopted directly or with minor adjustments.
The remaining third produces unusable or simply incorrect code, and the testing effort there is just as high as a manual correction from the outset.
Large language models almost never refuse to answer: in a benchmark of one hundred findings, the models only said “I don’t know” twice and preferred to produce faulty code instead.
Closed-source code performs significantly worse in LLM-based security analyses than publicly available code because it is simply not included in the training data.

Static analysis delivers too many findings to be processed by hand

Static analysis throws up so many issues on large systems that manually fixing them fails due to a sheer volume limit. A typical rule set comprises hundreds to over a thousand rules. The result: a few thousand, a few tens of thousands, in some cases millions of findings in a single code set.

These findings vary greatly in difficulty. Some are trivial, such as a missing comment on a method. Others are serious, for example a potential SQL injection or a possible null pointer dereferencing. The severity therefore varies greatly.

In practice, this flood often leads to avoidance. Benjamin Hummel describes two common patterns: teams switch off rules, or they no longer look at the results at all. Switching off is the more constructive option. The recommendation is deliberately to tackle the serious findings first and not to be overwhelmed by the masses.

Why large language models help to eliminate findings in the first place

Language models shift the remediation from the pure syntax level to a level that understands code and natural language at the same time. This was hardly possible with classic static analysis.

Static analysis is good at parsing code, building syntax trees and tracing data flows. However, this remains syntactic and language semantic. The natural language contained in comments, names and descriptions was previously beyond its reach.

One example makes the leap tangible: a missing method comment. Earlier documentation generators took the method name, inserted spaces and produced largely meaningless texts. Today, if you put the entire method content into a language model, you get a useful summary with meaningful details. The understanding of code is astonishingly advanced here.

The real leverage therefore lies not in generating new code, but in improving existing code. Instead of starting from scratch, you get a suggestion that you only need to check and refine.

How an AI-supported fix is actually created

The naive and surprisingly viable approach: you give the model the affected piece of code, the description of the problem from the static analysis and, if available, the explanatory text on why it is a problem and what typical fixes look like. The model then delivers the corrected code.

This works across a broad spectrum of problem classes. It ranges from added comments and the removal of unused imports to more complex refactorings.

Longer methods in particular show how effective this is. A model suggests extracting two meaningfully named methods and the result is consistent in itself. This is no small feat, because even manually deciding where to cut a method requires a lot of thought.

The hit rate is two thirds, the rest is noise

A systematic benchmark shows that in around two thirds of cases, the AI-supported correction works really well. The remaining third is noise.

For the evaluation, around 100 breaches were taken from a larger project and run through with various models. The models change every month, so it was necessary to compare several models. The evaluation itself was manual work: each of the 100 cases per model was checked manually to see whether the solution made sense, whether the code remained valid and whether it continued to do what it did before.

The good third picture in detail: the useful suggestions can be adopted directly or reused with minor adjustments, such as renaming. This saves a lot of work.

The bad third is trickier. This results in unusable code, sometimes even code that is not even valid. The effort involved in checking such a proposal and checking for subtle errors is often just as high or higher than cleaning up the site yourself.

Models want to please instead of admitting that they don’t know what to do

A central problem: language models are strongly trained to present a solution, even if they are clearly not familiar with the problem. Even if the prompt is formulated openly and explicitly allows you to reject it with an “I don’t know”, this hardly ever happens.

Across all experiments, there were only two cases in which a model admitted that it did not know the solution. Instead, the models try to produce a solution by force.

It is precisely this behavior that causes more problems than honestly giving up. If the model said “I don’t know”, you would immediately know that you are doing the job by hand without giving it much thought.

The models are very strongly trained to present a solution, even if you have the impression that they can’t do it at all. They want to please.
Benjamin Hummel

Mainstream languages work, exotics and closed source break in

The quality of the suggestions depends directly on the amount of training data for the respective language. Mainstream languages such as Java or JavaScript work well because they are abundantly represented in the training data.

The picture is different for niches. For areas such as ABAP-related SAP development, there is significantly less open source training data than for Python, for example. Poorer results are to be expected there. This is consistent with the behavior of code generators in general: they produce noticeably weaker results for exotic languages.

The effect is even more pronounced for in-house, non-public code. In an ongoing thesis on security analysis with language models, two data sets were compared: a common benchmark and our own code, which was guaranteed not to be included in training data. The results were significantly worse on the in-house code.

This gap is practically relevant. A large proportion of company software is developed in closed source and therefore does not appear in training data. It is precisely where the processes should run in everyday life that they are weakest.

AI interpreters often only package the results of classic tools

Many tools that promise automatic code reviews on platforms such as GitHub have a classic open-source linter running in the background. The findings are then reformulated linguistically by a language model.

This creates the impression that the AI is performing the actual analysis. In fact, it mainly packages the linter’s findings. “AI” on the packaging has now become a sales argument.

The general quality of real AI findings remains mixed. Problems are reported that are not problems, and many real problems remain undetected. However, AI shares the lack of completeness with any static analysis; none of them is complete.

Fix findings where you are working on the code anyway

AI is not the obvious answer for old, established systems. The tried and tested strategy remains: Fix problems exactly where work is being done anyway.

If, for example, you touch a module for calculating taxes because of a change in the law, it makes sense to clean up the findings in precisely this module. Cleaning up parts that nobody touches is hardly worthwhile. An exception to this are high-risk topics such as SQL injection risks, which are targeted. With this step-by-step method, you will gradually end up with a better system.

The probability of errors speaks against proactive, large-scale clean-up with AI. Even if 99.9 percent of fixes are correct, you will still introduce 100 new errors when cleaning up 100,000 findings. Nobody wants that.

For the worst areas, i.e. highly nested modules that are architecturally almost impossible to keep track of, AI is currently of no help. The only option here is to take the serious decision to rebuild such parts and validate them with testers.

Generated code is not new, but AI takes away its trust

AI-generated code raises an old question with new urgency: How much readability is needed in code that no human looks into anymore? Code from generators has been around for a long time, such as parsers from grammars or application code from domain-specific languages. Such generated code is subject to different readability requirements because the original model is adapted and regenerated when changes are made.

The decisive difference to AI lies in trust. Classic generators and compilers are deterministic: same input, same result. They are developed and tested over a long period of time, which is why hardly anyone looks for the error in the compiler first. Language models, on the other hand, hallucinate, make mistakes and deliver different outputs due to their probabilistic factor.

This results in a concrete risk. If the AI builds a system that you as a human can no longer understand, and you leave every change back to the AI, you are left without a handle at the point where things can no longer continue. This is exactly where testers become more important, because the question of whether the system is still doing the right thing remains unanswered.