We have about 12 different sites or parts of sites that could have outages and two of us to manage them. Some of these have been up for years, and some are newer. Some applications require special installation or debugging, and some must be on differently-configured servers.
When one of those goes down, we both need to know how to diagnose and fix it as soon as possible. So how do we manage that?
There are a lot of ways to get outage alerts for sites. If you want basic up/down notifications via SMS or Slack, you can use Uptime Robot, which we did for years (and still use now for one service). If you're willing to pay for a service, you can use something like Pingdom.
If you want to be called, however (and both of us would certainly sleep through an SMS in the middle of the night), you need something more robust.
There are, again, paid services that are probably very good and feature-packed, but we found they cost far more than the solution we put together. To keep our costs down, we're using Cabot. It's a sweet open source Python application that we're running on a 1 GB DigitalOcean droplet (10 USD per month). It can do a variety of checks to determine whether a site/service is running correctly, including:
- Checking for a 200 response from a URL,
- Checking for certain text on the page, or
- Verifying the success of a Jenkins job (which we use for builds and cron jobs).
Each of our sites generally has at least a check for a 200 response and for certain text on the page.
We connected Cabot with Slack and Twilio, which we loaded with 20 USD worth of voice calls. The calls we get are very simple and short:
This is an urgent message from Arachnys monitoring. Service "FPG Website" is erroring. Please check Cabot urgently.
In six months of use, we gone through $10 of that initial $20.
The final bit of outage reporting is to tell Cabot who to call, which can be done through a simple Google Calendar. We alternate days, running from 08:00 to 08:00, so our calendar consists of two repeating events. If we need to change who's on call that day, we can just adjust the Google calendar and Cabot will update.
Storing the Knowledge
When we have an outage, we want two pieces of info:
- What do we need to tell users (and how)?
- How do we fix it?
Both of these sets of info need to be somewhere searchable and well-structured. We're currently using a wiki. MediaWiki is very popular and well-supported, but also... hefty. We're using Waliki, another Python app, but this one is deployed to our cheap shared hosting provider. It's a place for internal documentation, solution logs, reference documents, etc. The kind of thing you could keep in Google Docs... if you remember to share most of your documents.
Why bother with the part above about notifying users? With Exploit: Zero Day being a paid game we host ourselves, it's great if people can find out very quickly that the game or the forums are down. In addition to social media posts, we post a banner or forum post on the site itself, depending on what's down. There's no equivalent for the FPG website, however, so it's just social media for updates to outages with it.
Different games and different sites have slightly different communication needs (including "none" for some). In the midst of an outage, it's easy to panic and forget, and it feels really crappy to be told by a customer that your stuff is broken... especially when you already know.
That communication plan doesn't need to be complicated, but it does need be written down and shared. A wiki is great for that.
Documenting server info is something that can evolve as you learn what can break. We have a parent page/category called "Servers", with individual sites or servers as sub-pages of that. Some example questions to answer for each of those sites:
- Who's the hosting provider?
- How do you login via SSH?
- What type of app is on the server (Wordpress? Django? Discorse)?
- Does the app deviate in its setup from similar apps you run?
- If the site uses multiple servers, how do I access each?
- Where are log files?
- Where are backups?
A Bit of Glue
In each service in Cabot, you can link to a "runbook"—put the URL for the site's wiki page there. Whenever an alert occurs, whoever is on call can quickly get to the wiki page with info to help resolve the issue.
Are we missing any cool ways to connect these to help with disaster management? Are there budget-friendly tools we should be using in conjunction with these? Want us to go into more depth about any of these components? Let us know in the comments!