Ben Franklin once said that nothing is certain except death and taxes. Today, I would add IT incidents to the list.
Here at Geko we’ve had to go through our share of operations, IT and security incidents. It’s something that is just bound to happen for a number of reasons (which we’ll get into later), and we’ve come to adopt a lot of habits and practices that makes these far easier to quickly detect, pinpoint, handle and resolve incidents of all kinds. It becomes a lot more manageable once you assume these will happen, and prepare for it. Can’t ever be too cautious when you’re talking about critical infrastructure for your entire team to work on, or your clients to access, it’s a “can’t afford to fail” scenario and you need to be ready for it.
There’s some countermeasures and checks you absolutely need to build into your infrastructure and ecosystem that makes this easier to handle:
- A robust monitoring platform
- A sensible alerting plan
- Service failover
- data snapshotting and backups
- A disaster recovery plan
Let’s go through each one of these and define why these are important.
How do I know when it’s down?
The very first step towards knowing your infrastructure has failed is knowing when it is not failing. You need to build a system that constantly checks every important part of your ecosystem so it notices deviations of that “working correctly” state. Status drift is that will absolutely be the first step of your fail state. Everything just works, then kinda works. Then one day, it doesn’t, and it’s deviated so much from the original state that you need to rebuild everything almost from scratch. You do not want to get here. So you monitor for system abnormalities. If something has to move, monitor it. If something doesn’t have to move, monitor it in case it moves.
Do I just sit someone looking at metrics?
Not much use of a monitoring stack if it doesn’t yell at you when something is going haywire. So as you set up the monitoring infrastructure, at the same pace you add alerting. How you do this is up to you, depends on the urgency of the task. It’s a server that has 70% of its disk full? maybe send a Slack message about it to your IT team. Is the company program server not responding to pings? You probably want it to call someone immediately to turn it on as soon as possible. Depends on the system’s urgency and how big the problem indicator is. Your environment, your priorities.
But that doesn’t keep the service running, does it?
Do you absolutely need a way to keep service up even on the event of failure? Keep a failover system. Maybe you can skip calling about the ping-failing server, or you can move that “change drive” down the priority list because you got another one on the RAID for now. Two is one, one is none. Keep a failover in basically anything important, even if it’s a manual failover system. Have something lined up to quickly change to and keep service.
What do I do when I get a fail alarm?
Let’s get into a disaster scenario for a second. Let’s say you ignored that S.M.A.R.T. alert for one day too long. Your main machine is gone, you can’t just press the “on” button for it to go back up and forget about it, and you somehow had your failover on that machine too, for example a scenario where your proxmox machine bit the dust. Congratulations, you’re facing an incident. So, for this case, which is bound to happen, you’ve prepared backups (hopefully) and you can swap that drive and restore it in. You may lose a day of work, but it’s nothing compared to losing everything and spending hundreds of hours rebuilding your company from scratch. It is especially important to remember the basic rule of backups generally known as the 3-2-1 rule:
- 3 copies of your data
- On 2 different types of storage media
- At least one of them in an offsite location
So that, even in case of an especially bad incident, like a fire, you can just restore from the offsite backup (even though that may take substantially more time depending on your solution). Also, check that your backups work. Please. A backup is not a backup until you test it.
Also, as an extra, do not pull a Michael Scott on your team.
But I can’t plan for everything that’ll happen, can’t I?
There’s a lot of things that can happen, and unfortunately you can’t predict all of them. Anything can happen and the more complex your infrastructure setup is, the more points of failure there are on it, so you need to prepare as much as possible. Maybe you can’t get all of them nailed down, but the more you plan for, the better, so if something happens, your on-call staff can just walk through a runbook on your documentation and fix the issue without much of a problem. This plan usually includes examples of scenarios that you consider possible based on your infrastructure setup, the elements affected by it, and how to fix this issue.
Sounds like I need one of those.
If an important element of your infrastructure fails, would you get a call, an email, or a Slack notification? Are you sure your backups work? How resistant is your product to a drive failure? If these scenarios sound like a problem in your case, maybe you’d find it useful to Contact us and we’ll talk about setting you up in a better state. You just have to make a choice: are you doing it now, or are you waiting after your next IT incident? Remember Picard’s words on management: