Who knows, you may find something useful here!

I don’t know who you are. I don’t know what you want. If you are looking for specific solutions I can tell you I don’t have any, but what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a problem-solver for people like you.
Some approaches and strategies come so intuitively to me that I don’t even think about them. I wanted to pull a more detailed explanation beyond just “my heart guides me.” I hope this will be useful to others.
⚠️ I work on this on the side, so expect it to be incomplete for a while! ⚠️
“We cannot choose our external circumstances, but we can always choose how we respond to them.” ― Epictetus
Before you start touching everything, you need to diagnose the general situation. Ask yourself questions like:
| Main Question | Follow-up Questions | Expected Insights |
|---|---|---|
| How bad is it? | What is the affected environment? | Â |
| Â | Does this impact team delivery? | Â |
| Â | Can we live with the issue? | Â |
| What could be a possible cause? | Naming | Unless you are prepared for the unexpected and have ways to avoid depending on the names of resources, some changes may cause disruption, as they could prevent you from accessing the right resource. |
| Â | Paths | Â |
| Â | Versions | A minor update may not cause issues, but you never know. Best to be sure. |
| Â | Policies | Â |
| Â | Rules | Â |
| Â | Making plans on a Friday | Horrible idea if you want a relaxed weekend. |
Reasoning through the scope and range of the issue will reduce the stress caused by the abrupt interruption of your day. And by knowing what it actually is, you can at least prepare some talking points for management while you figure things out.
You can think of this as panicking ahead of time so you will know what to do when things go wrong. Preparation is key, and it is normal to ignore things that are “working” and don’t add value to day-to-day work. But look at it this way: you are a prepper, and that will pay off eventually.
“Expect the best, prepare for the worst.” - Muhammad Ali Jinnah
Considering worst-case scenarios: like making the issue worse, or the existential dread of a 2 AM escalation: is a powerful motivator to set boundaries when troubleshooting. If you find yourself making things worse, it is a clear sign that those boundaries could have prevented the situation. The consequences of not setting them can be severe.
Examples of sensible limits:
Some of these may seem counterintuitive, but tech debt makes our work more exciting and, above all, more dangerous. Discovering those flaws in a critical moment will not make your life easier.
And if you do mess up despite all that, that is a great story for a barbecue someday.
If your anxiety is not already working overtime, a good use of calm time is understanding the dependencies of the applications and services you work on.
You should be able to answer questions like:
As a side note: I always thought Chaos Engineering was about deliberately breaking things and hardening them. But that is only about 20% of it. The rest is about learning your systems and their dependencies, because in large architectures, many components are not under your control.
“When you stare into the abyss, the abyss doubles it and gives it to the next person.” - TikTok Nietzsche
Trying to figure it out alone is rarely worth it. It may look cool, but it is usually better to rely on others, like:
Everyone will have a piece of information that helps you uncover the mystery, and by that, I mean the tech debt.
Be aware that sometimes team members will not have all the answers. It may be that they did not work on that part, or as often happens in technology, things change fast and people remember parts of multiple different architectures.
This is a long way of saying: if they do not know, that is understandable. If you do not know anything about the system you own, and that is worth fixing.
Remember:
This may not apply to every company, but it shares common patterns when working with large teams. A useful mental chain to follow is:
Infrastructure > Development > Additional Teams > Dependencies > Blame the Cloud Provider
We will revisit this in the Proper Escalation section. But before jumping into each type, let’s talk techniques.
Where should I start? What can I check? Why me? Why on a Friday? - Engineer
It is normal to question where to begin. Without a structured approach, you may find yourself wandering aimlessly, which wastes time and adds stress. As you grow in your career, having a clear direction for tackling issues becomes one of your most valuable skills.
I will eventually go deeper into each of these, but consider the following as a suggested approach.
If everything has been quiet with no recent changes, start here:
If you are in the middle of implementation or testing, use the classic approach:
If the problem does not seem to be going anywhere soon, you will definitely need to try any or all of the following:
Making the process easy to follow is considerate to anyone who joins mid-way. There is nothing worse than feeling like you are starting from scratch in an ongoing incident. If things are truly at a dead end, fresh perspective is welcome, but only if there is enough context to get up to speed quickly.
Knowing when and how to escalate is a skill in itself. The goal is not to cover yourself, but to get the right people involved before the situation gets worse.
Start here when the application was working before and nothing in the code has changed. Things to check:
If infrastructure is clean, move on.
If the infra layer is not the problem, look at the application. Things to check:
git log and git diff are your friends hereWhen neither infrastructure nor code is the culprit, look outward. Things to check:
The catch-all category. Sometimes the problem is not where you expect:
“If you’re the smartest person in the room, you’re in the wrong room.” - Richard P. Feynman
There is no shortcut here. Every incident you work through adds to your pattern library. The engineer who has seen a certificate expire three times will spot the fourth one before the monitoring does. The one who debugged a race condition at 2 AM will recognize its symptoms in a code review.
Experience is not just about fixing things faster. It is about knowing which questions to ask first, which assumptions to challenge, and when to stop troubleshooting and escalate.
“When people put you up on a pedestal, don’t come off it acting humble. Stay up there, because if they put you there, that’s showing you how high they can see. Stay there and pull them up.” - Victor Wooten
Document your findings. Write runbooks. Create a post-mortem even for incidents no one else noticed. Do not be the single point of failure who holds all the critical knowledge in their head.
Sharing knowledge is what turns a good engineer into a great colleague. The next person to face a similar problem might be you, six months from now, at 1 AM, with no memory of how you solved it the first time.
When nothing is working and pressure is mounting, resist the urge to keep trying random things. That is how small incidents become big ones.
Sometimes the right Plan B is admitting you need more time, more information, or more people. Knowing when to say that is not a weakness. It is good engineering.
After enough incidents, patterns emerge. Before going deep into a complex investigation, always check the classics first: