Why agentic engineering changes everything

Agentic engineering refers to the systematic approach to developing software with AI agents, where architecture, quality principles and environment design are more important than writing individual prompts. Good principles accelerate progress, bad principles accelerate regression. Specialized cleanup agents continuously perform tasks such as deduplication, refactoring and test coverage.

Key Takeaways

If you have bad principles and use AI, you get worse faster; if you have good principles, you get better faster because AI simply accelerates the existing course.
Agentic engineering replaces prompt engineering as the central skill: your own prompt only accounts for around 20 out of 20,000 tokens in the context; the configuration of the agent environment is crucial.
A permanently running cleanup agent, which checks for deduplication, missing tests and quality characteristics after every change, replaces manual refactoring after the first throw.
Five developers, all working agentically at the same time, stand on each other’s feet because agents immediately touch the entire stack; smaller, cross-functional teams with clearly separated architecture are the consequence.
When switching to a new model, it is worth throwing away all previous skills and configurations because outdated settings worsen the new standard behavior of the model instead of improving it.

Vibecoding or agentic engineering: why the term makes all the difference

Agentic engineering describes AI-supported software development more precisely than the popular term vibecoding. Andrej Karpathy, who coined vibecoding a good year ago on X, later corrected himself and now prefers the term agentic engineering.

The difference is in the word. Engineering brings architecture, quality assurance and engineering back into the discussion. It’s not just about creating code, but about an engineering activity with principles and methods.

Existing knowledge about architecture, paradigms and quality does not lose its value. On the contrary, it becomes more valuable. The task shifts to systematizing this knowledge and transferring it into agents and their configuration instead of just keeping it in the heads of the developers.

Why the prompt decides less today than expected

In agentic environments, the formulation of the prompt hardly makes a big difference anymore. Classic prompt engineering, which was the focus a year ago, is still relevant for direct model interaction, but it is becoming less important in agentic setups.

The reason lies in the context. An agent already sends around 20,000 tokens via its system prompt, in which it introduces itself to the model and describes its tools: which files it can read and edit. A separate prompt of 20 tokens, on the other hand, hardly carries any weight.

The actual task of the prompt is to enable the agent to obtain the necessary information itself. The better the environment is configured, the less the prompt needs to contain. Known problems are written in the prompt, solved problems belong in the configuration.

Reverse engineering instead of better prompts

The better lever is not the optimized prompt, but the observation of the default behaviour. You deliberately start with a bad prompt without configuration and see how the model behaves.

The correction is derived from this observation. Instead of asking what a better prompt should look like, you ask what the agent was missing in its context file or its tools so that it would have solved the task with the bad prompt alone.

This works via a guided conversation. You steer the agent manually, correct its direction, and if the result is correct in the end, you let it evaluate the conversation: Where was it right, where did it need to rework, where did you intervene. These findings end up in the configuration file so that the same corrections are not necessary the next time.

The procedure is similar to a retrospective from agile working. You ask what went well and what went badly and record the core essence. Manufacturers have now recognized this pattern as best practice and incorporated it into their tools, for example as a command in Claude Code that evaluates the conversation and writes down the findings.

A cleanup crew cleans up what the first litter leaves behind

Instead of making the code perfect on the first throw, separate agents take over the cleanup in the background. One such agent loops over all changes and looks for consolidation and simplification.

This division of labour can be specialized. One agent checks for security, another for quality characteristics, a third agent only does deduplication. If it recognizes that someone has built a new modal window even though a component already exists, it converts it into a component. If a test for a new function is missing, it adds it.

This shifts the requirement. It remains better to build cleanly from the outset, but the feature does not have to be perfect on the first pass if another agent cleans up afterwards. This is similar to the usual approach of many developers who first let something go out and then refactor it, only now the refactoring is no longer done by themselves.

Why the choice of model becomes an architectural decision

Changing the model devalues the painstakingly developed configuration because each model has different defaults and different system prompts. The intuition gained through reverse engineering is always tied to a specific model and a specific harness.

It is recommended to work on the latest state-of-the-art models and to always focus on one model. It makes sense to divide the work by task: create the plan with an expensive model and the implementation with a cheaper, faster model because the plan has already found the necessary context.

When a new model appears, it is worth throwing away your own skills and configurations and starting again. Otherwise, old skills will lead the new model in the wrong direction and restrict it. That would be the worst case scenario: a new, more capable model that is pushed down to the level of its predecessor by the old configuration.

It is important to note that new models do not usually become smarter in the sense of having more knowledge. They often have the same training cut-off as their predecessor. They perform closer to what a human would expect for the task.

The production line replaces the individual feature

The next stage no longer builds the feature, but the road that produces features. Instead of implementing a search itself, a production line is created that generates the implementation from the incoming feature requests in the context of the specific system and company.

This approach also solves the model issue more cleanly. The appropriate model can be hardwired into a production line for each step and tested in a targeted manner. Manual work is less process-heavy, a line allows the explicit assignment of model to step.

Another abstraction, not a new compiler

AI is not simply the next level of abstraction like Java via assembler. The comparison is flawed because there is no compiler that guarantees a deterministic result.

With assembler, the generated code is of no interest because the compiler translates reliably. With AI, this guarantee no longer applies. If you no longer want to worry about the code, you need other safeguards, such as the cleaning agent, instead of being able to rely on a deterministic translation.

This results in a shift in skills. The code moves into the background, while the principles and approaches under which good code is created move to the forefront. This is architecture work, and it is more architecture work than before.

Those who use AI accelerate their existing principles

Bad principles plus AI lead to worse results faster, good principles plus AI lead to better results faster. Current studies from the DORA environment have already factored in the AI effects and show precisely this mechanism.

The consequence has two sides. Teams that now blindly rely on AI can hit the wall faster. And the gap between high-performing and low-performing organizations is widening, not narrowing.

The antidotes are not new. Unit testing, integration testing, shift left and a functioning pipeline are homework that needs to be done anyway. If you produce pull requests 500 times as fast without them and then put a central QA department in front of it, you will immediately lose the productivity you have gained.

If you have bad principles and use AI, you will get worse faster. And if you have good principles, you get better faster.
Benedikt Stemmildt

Why teams step on each other’s toes when working agentically

Most experience reports come from individuals, hardly anyone has solved working as a team. This is exactly where the biggest problems arise.

Five developers, all working as agents, get in each other’s way. In classic planning, the work is divided up and everyone takes a section. With agents, on the other hand, everyone develops the entire stack at once. Merge conflicts are the minor problem that the AI solves itself. It is more difficult to keep track of the development of the product at all.

The answer is smaller units. Product managers, product owners and developers are increasingly becoming one role, the product engineer. Three such people plus an agent make a team that can program well as an ensemble.

Organization and architecture must fit together

A cross-functional layout of the teams is a prerequisite; separate backend and frontend teams cannot be meaningfully reduced for agent-based work. Five frontend and five backend teams do not result in a sustainable structure.

Simply cutting up the organization is not enough. Conway’s Law and socio-technical architecture are intertwined here. The architecture must be cut in such a way that the agents do not step on each other’s toes in the code.

In practical terms, this means self-contained systems or a modulith rather than microservices, where there is a risk of blocking each other again. End-to-end self-contained units give the agents and teams working in parallel the space they need.

AI-containing systems are created differently than deterministic software

As soon as a system itself contains AI, such as an agent that reads through a regulation, the development process changes fundamentally. Such systems are not deterministic like the software that has been built so far.

This requires smaller iterations and significantly more trial and error. With traditional software, a developer has an idea of the solution and implements an algorithm. With AI-containing systems, you have to experiment a lot more because the behavior cannot be determined in advance.

This means that old control models reach their limits. Requirements and functional specifications are out of the question anyway, but even agile principles are no longer sufficient for this type of development.