Four architectures, one system

Four architectures, one system

· 1,168 words · 6 minutes reading time

Infrastructure waits quietly - until everyone needs it at once.

In early 2015, I built a scheduling system that was supposed to replace a shared spreadsheet. It is still running today. During those eleven years, the software was rewritten four times. Not because it failed, not because new technology appeared, and not because performance demanded it, but because reality revealed that the problems worth solving were different from the ones I had originally optimized for.

This is the story of how a small prototype for one group of doctors evolved into a long-running production system, and why every rewrite made it simpler.

A small problem

The project began with a complaint.

My brother, a physician participating in the Ärztlicher Notdienst in a German federal state, was dissatisfied with how shifts were assigned. The regional organization generated rosters using a round-robin scheme based on an alphabetically ordered list of doctors. Administratively simple, but practically frustrating: vacations, family events, and professional obligations were ignored. Doctors spent significant effort trading shifts afterward to make schedules workable.

He asked whether doctors could organize the roster themselves. The first users were a single region - a few hundred physicians coordinating emergency duty around one hospital network. The problem looked small. The first attempt resembled little more than a structured alternative to a shared Google spreadsheet.

I expected a hobby project.

Instead, adoption spread quickly. Other regions joined after hearing how much organizational friction disappeared. The most surprising discovery was not technical at all: a relatively small piece of software solved a coordination problem that had persisted inside a large institution for years.

Architecture 1: a prototype that escaped

To build something quickly, I reused a CMS platform I had co-developed years earlier as founder and CTO of a startup. The company had closed in 2002, but the software (a Java-based CMS) remained privately maintained.

It was available, familiar, and adaptable. That was enough.

The initial system:

  • CMS-based application
  • interpreted environment
  • SQL database backend
  • single tenant

Speed mattered more than architecture. The goal was validation.

The prototype worked well enough that it entered production and stayed there for two years. Growth exposed its limits. Additional regions required isolation of users, schedules, and workflows. Multi-tenancy was unavoidable.

The first real rewrite began.

Architecture 2: doing it properly

In March 2016, I rebuilt the system using a conservative and well-understood stack:

  • Kotlin on the JVM
  • an early Ktor version
  • Tomcat application server
  • JDBC persistence
  • MySQL database

The objective was straightforward: transform a successful prototype into a maintainable multi-tenant system. The rewrite itself was uneventful. The system entered production and operated reliably for about a year.

Then the system encountered its first real failure.

The incident

At the start of a booking phase - the busiest minutes of the entire scheduling cycle - the database became corrupted. The shift table became unreadable. A backup system existed and functioned correctly. Recovery took roughly fifteen minutes.

Technically acceptable

Operationally disastrous.

During booking openings, hundreds of doctors attempt to reserve preferred shifts simultaneously. Fairness depends on responsiveness in those first minutes. A fifteen-minute outage destroyed that assumption. Shortly after recovery, the corruption happened again.

The failure revealed something unexpected: The problem was not availability. The problem was recovery time.

Architecture 3: removing the database

In April 2017, persistence moved from a relational database to the filesystem. Recovery became an atomic file replacement. Instant.

This eliminated the dominant failure mode but transferred responsibility for consistency into the application itself. Access to tenant data had to be synchronized:

  • reads remained concurrent,
  • writes were serialized per tenant,
  • requests occasionally waited during heavy activity.

Latency increased slightly but remained below 500 ms even during peak load for the largest region with approx. 500 participants.

The optimization target had changed.

BeforeAfter
maximize concurrencyminimize recovery time
database-managed consistencyapplication-managed consistency
complex restoreinstant recovery

The filesystem was not superior in general. It would have been the wrong choice for tens of thousands of users. But it was superior for this workload.

History as a side effect

Each modification wrote a new backup file alongside the primary data file. Over a booking season:

  • approx. 500 doctors participate
  • roughly 3,000 shifts exist
  • a little more than 4,000 history files accumulate through cancellations and rebookings

The database version also logged activity - but those logs vanished together with corrupted data. Filesystem backups, copied asynchronously to independent storage, survived runtime failures.

The improvement was not absolute safety but separation of failure domains. Backup moved outside the request path. The system unintentionally gained a transparent history of decisions.

Stability

After the filesystem rewrite, the system stabilized.

It runs today on a Netcup Root Server (4 cores, 16 GB RAM, Ubuntu Linux), operating below 10% CPU during peak booking phases. Memory usage remains under 8 GB. Users rarely noticed deployments. Interfaces remained stable. Updates were scheduled during off-hours; writes paused briefly while reads continued uninterrupted. Major incidents disappeared. Only one long-uptime Ktor session bug required introducing a scheduled reboot every couple of weeks.

Otherwise, the system simply ran. For years.

Architecture 4: when time becomes the problem

The next rewrite was not triggered by failure but by aging dependencies.

The Kotlin/Ktor/Tomcat stack depended on early ecosystem libraries. Some repositories disappeared. Builds survived only because I backed up my local Maven .m2 cache.

The application became locked to:

  • Ktor 1.2
  • older Kotlin versions
  • Tomcat 8

Upgrading further was impossible. The system was stable - but frozen. Evolvability had vanished.

In October 2025, I performed the final rewrite:

  • modern Kotlin
  • current Ktor
  • embedded Netty server
  • standalone deployment

Tomcat disappeared. WAR packaging disappeared. Dependency surface shrank dramatically. Deployment became a simple scripted restart of a systemd service. Functionality remained unchanged. The rewrite restored the ability to evolve.

What changed and what didn’t

Across eleven years, nearly every technical layer changed. The domain did not. Doctors booking shifts fairly among themselves remained constant. The core model survived four complete rewrites almost untouched.

Each rewrite removed a layer whose original benefit had expired:

RewriteRemoved Constraint
CMS → Kotlinframework limitations
Database → filesystemslow recovery
Tomcat → Nettydependency lock-in

The system did not become more sophisticated. It became smaller.

Eleven years

Three things surprised me.

First, how a small coordination problem could spread organically once solved well. Second, that serious failures appeared where I least expected them - not in complex logic, but in infrastructure assumptions. Third, that longevity rewards simplicity more than sophistication.

Long-lived systems optimize for properties that are easy to underestimate at the beginning:

  • recoverability matters more than uptime,
  • operational simplicity matters more than architectural completeness,
  • evolvability matters more than technological modernity,
  • smaller dependency surfaces age better.

The project began as an alternative to a shared spreadsheet for one group of doctors. It still runs today - simpler than before, and closer to the problem it was meant to solve.

Software longevity, I learned, is less about choosing the right architecture at the beginning and more about removing what turns out not to matter.