How do I survive a cloud migration?

A cloud migration comprises three interdependent levels: Infrastructure, software architecture and organization. In technical terms, it usually begins with lift and shift, i.e. moving an application to the cloud unchanged, followed by step-by-step refactoring along functional domain sections. Without automated CI/CD pipelines, resilience testing and organizational changes, the migration will fail.

Key Takeaways

New greenfield implementations regularly fail: teams that develop a new system in parallel to the monolith usually abandon such projects after around two years.
A functioning CI/CD pipeline must be in place before any migration work, not after, because a system with many microservices and databases is simply not maintainable without automation.
Cloud migration changes job profiles in concrete terms: DBAs lose tasks when Oracle is replaced by a managed service, and developers who only deliver Jar files have to expand their area of responsibility.
Resilience tests are missing from traditional test suites because container failures, increased latency and brief unavailability hardly ever occur with on-premise systems, but are part of normal operation in the cloud.

Why companies are moving to the cloud

Three motives drive most cloud migrations: Cost hope, flexibility and the pressure because everyone is doing it. The first two are tangible, the third is a hype effect that develops its own momentum.

When it comes to costs, many expect that their own infrastructure and the associated personnel will be eliminated. The calculation looks good on paper. In practice, the picture is often different because new complexity arises elsewhere.

Flexibility is the stronger argument. The large hyperscalers operate data centers all over the planet. If you want to address customers in Australia, you can’t sensibly do so from Switzerland or Germany. Hardly any individual company has this geographical reach in its own hands.

A VM in the cloud is just a computer located somewhere else

From a technical perspective, a virtual machine in the cloud is initially just that: a computer in a different location. This sober view helps to dispel a common illusion, namely that the cloud automatically solves the old problems.

The switch brings to light problems that have always existed locally. Security, access protection, update cycles. Suddenly the question of how often a VM is booted or what day-two operations look like for a database system or Kubernetes cluster becomes relevant.

These questions also existed on-premise. Linux systems also had to be updated every two weeks before. The cloud only makes these obligations visible because they can no longer be hidden in your own data center. This is exactly what makes migration difficult, along with the cultural issues.

Three construction sites: Infrastructure, architecture, organization

Every cloud migration breaks down into three areas, and only one of them is purely technical.

Infrastructure: The path from local to cloud infrastructure. Cloud providers provide an API with which this can be set up relatively quickly. However, the managed services behind this bring their own complexity, which makes the whole process more difficult.
Architecture: Long-grown monoliths are often poorly built. What many overlook: A monolith can also have an architecture, for example via clean Java packages. The task is to split up such monoliths in a sensible way.
Organization: Over the years, organizational units have developed around on-premise databases that no longer fit the new world. This is about personal responsibility and agile working methods.

Where the main problems lie varies from company to company. Banks and insurance companies have particularly long-lived systems and equally long-lived organizations. You start at the weakest point.

How a migration works step by step

The sensible way to start is to lift and shift: you take the existing application, containerize it if necessary and start it in a cloud VM. The main purpose of this first step is to gain experience.

Further stages build on this. During re-hosting, the monolith moves to the cloud, which already requires a CI/CD pipeline to be in place. During re-platforming, the thing is placed in a Kubernetes cluster, and logging and monitoring run via the cloud provider’s services.

Only then does the refactoring take place. This is the actual architecture issue, where the structure is improved if you want to. There are certainly reasons not to take this last step.

Greenfield rebuilds fail, step-by-step cutting helps

The attempt to completely rebuild a monolith with a separate greenfield team goes wrong in practice. The pattern is familiar: Three teams maintain the monolith, a fourth is supposed to reimplement it with clean cuts. Such projects are usually abandoned after around two years.

The viable path leads via the existing monolith. You work out the technical domains step by step, spread over several iterations, not in a single sprint.

As soon as the interactions between the domains are minimal, there are clean interfaces and the database schema is separated, a domain can be cut out and packed into its own container. The cut is functional, not technical. This is the basic principle of microservices.

An example from the banking environment is onboarding. A customer comes in, their data is recorded and initial products are offered depending on the investment amount and risk appetite. This onboarding process forms a self-contained domain that you can start with.

The database is the hardest construction site

Many companies operate a large Oracle platform, and DBAs sit in front of it, often preventing architectural changes. This issue is both technical and organizational.

An Oracle database can perhaps technically be ported to a large hyperscaler, but this is usually not what people want. This is precisely one of the biggest tasks of any migration.

A data migration from one database system to another must ensure that the data is still the same afterwards. This starts with character sets and field lengths and goes further. When migrating from an old system to PostgreSQL, there will certainly be some things that need to be noticed before it goes live.

A microservice system is not maintainable without a CI/CD pipeline

Before you migrate anything, create a continuous delivery pipeline. This is the first step in moving to the cloud, not the last.

A system consisting of 40 or 50 microservices with replicas and underlying databases is no longer maintainable without testing and a CI/CD pipeline. Automation must be built in right from the start.

When you make a microservice like this, you always build the CI/CD pipeline first and only then start writing the code. Christopher Schmidt

The pipeline includes unit testing, component testing and cleanly tested microservices. This is exactly where the errors that are guaranteed to occur during data migration occur. And the pipeline itself must also be subject to a process. If you haven’t thought about a clear procedure beforehand, you are building something that is constantly breaking away.

Migration changes people, not just machines

A cloud migration cannot be carried out as a submarine project. If a single team decides that local IT is too slow and quietly moves to the cloud, it won’t last long. Due to the organizational consequences, the management must support such projects.

The classic silo structure does not fit well in the new world. An infrastructure team, an ops team, developers who neither know base images nor write deployment manifests because ops takes over. This separation needs to be dissolved, and that can only be done by agreeing targets and incentives that are adapted. That takes time.

Specific roles are changing. Today, DBAs are responsible for query performance and optimize queries directly in the database system. With the switch to a managed service such as RDS from AWS, they are no longer necessary to this extent. Developers whose self-image ends with “my artifact is a jar, I’m not interested in anything else” also have to change.

At the same time, not everyone can know everything in a complex environment. Core competencies must remain in the teams at individual points. The claim that everyone has full access to all knowledge does not work.

Resilience testing belongs in every cloud pipeline

The cloud adds a test category that traditional functional testing does not cover: resilience to latency and failures. You probably already had functional, unit and component testing before. That’s no longer enough.

In a cluster, containers are temporarily unavailable, for example because a node has rebooted or died. An API fails briefly, latencies increase. A monolith becomes its own processes, distributed across different machines and sometimes different data centers. This results in latencies and error situations that did not exist before.

This is particularly noticeable in the hybrid cloud scenario. If part of the application is in the cloud while service or database systems remain on-premise, the connection runs via VPNs or SSL. The latency then jumps from around ten to one hundred milliseconds. This is often enough to cause thread pools to overflow.

Such resilience tests belong in the CI/CD pipeline. If you leave them out, you will later wonder why parts of the new software regularly crash. There may have been something similar in your own data center, but you ignored it or it happened less frequently.

Maintenance window or quick response

Managed services allow update scenarios to be packed into maintenance windows, for example on Saturdays or Sundays. This sounds convenient, but it comes at a price that you have to consciously weigh up.

If you schedule the window for the weekend, a vulnerability can remain online until then. Fixed vulnerabilities are listed in the release notes, and anyone who wants to break in will check the known Java or Linux standard problems to see whether a system is still vulnerable.

The recommendation is clear: work without a maintenance window and deploy quickly. An application that can deal with short outages and higher latencies does not need the planned window to the same extent anyway. This shifts the question back to the resilience of the architecture.