Yet Another Day in the Life: When Three Problems Pretend to Be One

Sometimes the universe decides that one production incident isn’t enough. It needs to stack them like Russian dolls, each one revealing another surprise when you think you’ve cracked it.

Last night I got the message every platform engineer dreads: “The demo isn’t working.” Not just any demo, mind you. A customer demo scheduled for 8:30 the next morning. The API was returning 500 errors across the board, and the clock was ticking.

Here’s the thing: I was coming into this pretty cold. This wasn’t a system I’d built or one I knew intimately. The developer who’d flagged the issue had no idea what was wrong. The previous developer who’d originally built the system? Also no idea. The infrastructure had evolved, configurations had drifted, and the knowledge had walked out the door.

So before I could fix anything, I had to map the dependencies on the fly. Which instance hosts what? Where does nginx route traffic? What services need to be running? Which environment files actually get loaded?

That kind of rapid system archaeology is half the job sometimes. You’re building the map while you’re navigating the territory. DevOps to the rescue, I suppose.

What followed was a three hour masterclass in why production systems love to humble you.

The First Layer: The Obvious Culprit

Initial investigation pointed to the instance being down. Fair enough. We have a Lambda function that stops non critical instances at 18:00 UTC to save costs. Sensible enough in theory. I fired up AWS Systems Manager (genuinely brilliant for remote debugging without SSH keys) and got the instance back online.

API responding. Excellent.

Still returning 500 errors. Less excellent.

The Second Layer: The Silent TypeError

Digging into the logs revealed a Python TypeError lurking in the code. The endpoint accepted a URL parameter that could be None, but the code downstream assumed it would always be a string. When the frontend sent an empty request body, Python’s str.replace() method threw its toys out of the pram because you can’t replace a substring with None.

The fix? A humble url or "" to convert None to an empty string. Three characters that would have saved hours of debugging if they’d been there from the start.

This is the kind of thing a developer would spot immediately if they knew to look at the logs. But knowing where the logs are, how to access a running instance, and how to interpret what systemd is telling you? That’s the platform engineering bit. The developers were stuck because they couldn’t get eyes on the problem.

Service restarted. Feeling optimistic.

Still failing. Now with 401 errors from the LLM provider.

The Third Layer: The Configuration Maze

Here’s where it gets interesting. The application was loading environment variables from a systemd environment file, not from the project’s .env file. The systemd service file specified the environment file location, and the API key stored there was invalid.

This is the kind of configuration sprawl that creeps into systems over time. Someone sets up the initial deployment one way, documentation gets lost, and six months later you’re grepping through systemd unit files at 10pm trying to figure out why your perfectly valid API key isn’t being used.

Neither the current nor previous developer knew about this. They’d been updating the .env file in the project directory, completely unaware that systemd was loading configuration from somewhere else entirely. Classic case of “it works on my machine” meeting “production is a different beast.”

Coming in cold, I had no idea where the configuration actually lived either. But I knew how to find out. Trace it backwards from the running process, through systemd, to the environment files. That dependency mapping I mentioned earlier? This is where it pays off.

New key deployed. Service restarted. Main demo working.

Then the user reports one of the client deployments still isn’t working.

The Fourth Layer: The Forgotten Service

The client specific backend service simply wasn’t running. It wasn’t enabled for auto start, so when the instance came back up, it stayed dead. Nginx was dutifully routing requests to the correct port, but nothing was listening.

The architecture here is actually quite elegant: nginx routes based on headers and URL paths to different backend ports, each serving a different client. But elegance doesn’t help when one of those backends is inactive and there’s no monitoring to tell you about it.

By this point I’d built up a mental model of the whole system. Seven different backend services, each on its own port, nginx doing the routing based on various headers and URL patterns. That map didn’t exist in any documentation. The developers didn’t have it. I built it from grepping config files and checking what was listening where.

What This Actually Taught Me

Beyond the immediate fixes, this incident crystallised something I’ve been thinking about for a while. The technical debt wasn’t in the code itself. It was in the gaps between components:

The gap between where we think configuration lives and where it actually lives. The gap between services being deployed and services being monitored. The gap between code handling the happy path and code handling edge cases. The gap between a service existing and a service being resilient.

Each of these gaps is tiny. Individually, they’re the kind of thing that gets pushed to “we’ll fix it later” because there’s always something more urgent. But stack them together on a Wednesday evening before a Thursday morning demo, and you’ve got a three hour incident that traces through four different root causes.

And when the developers don’t have visibility into the infrastructure, these gaps become chasms. The code was fine. The deployment was the problem. And that’s a different skill set entirely.

The Uncomfortable Truth

Twenty years in this industry and I still get caught out by the basics. Defensive coding. Centralised configuration. Services that restart themselves. Monitoring that tells you when things break before your customers do. Documentation that tells the next person where to look.

None of this is cutting edge. None of it requires fancy tooling or expensive platforms. It just requires the discipline to do the boring stuff properly, every time, even when you’re rushing to ship a feature.

The incident got resolved. The demo went ahead. But I’ve added a few items to the backlog that aren’t optional anymore: health check endpoints, proper alerting, a configuration audit, and yes, some actual documentation so the next person doesn’t have to build the map from scratch.

Because the next time something breaks at 7pm, the developers should be able to fix it themselves. That’s the real win.

Some days you’re the engineer. Some days you’re the one learning the lesson again for the first time.

If you’ve ever spent an evening peeling back layers of a production incident like some kind of infrastructure archaeologist, I’d love to hear your war stories. Sometimes the best thing about this job is knowing you’re not the only one.