Monitoring and Uptime

Some fun notes about monitoring and uptime.

I’ll be giving some shout-outs to some products which have really helped me out. I’m not getting anything from them, (in fact my company is paying them money). However they’re making my life a bit easier.

So in world not so long ago, our monitoring situation was pure Nagios, and a physical pager. The pager was a bit like the hot potato in that it marked that you were on call. May as well have been an albatross for all it was worth. The service was cheap enough, but of course we had to come up with creative ways to create the on call schedule. Distribute it to folks that needed it, etc. Now there are probably cool ways to page with Nagios, and I’m sure people are using them, however we decided to outsource it to PagerDuty. It’s one of those companies that makes every sysadmin “Gah, why didn’t I think of that?!”.

PagerDuty takes care of the oncall schedule, rotations, being able to easily tweak the rotation because your on call guy wants to take vacation that week, or is otherwise unavailable, as well as the alerting you parts. It can call you, SMS you, e-mail you, all three at once. You get to pick. On top of that their rates are very reasonable. If you’re using a real pager, take a serious look at PagerDuty. I’ve got about 35 folks on it, multiple on-call rotations, and it’s working quite nicely.

Second shout out is for Keynote. You may be using Nagios (or Icinga) to monitor your uptime, and hey they have nifty performance metrics too! If you’re monitoring this stuff from the same datacenter that your website is on, you’re doing it wrong if you’re trying to figure out what your availability is.

Consider this (real) scenario. You’re tasked with providing marketing or sales, or whomever with you how awesome your site uptime is. Maybe it’s 100% and you don’t even do any outside monitoring, in which case you’re lucky or a liar. One day the power goes out in your datacenter, the whole place is dark for a bit, and of course it takes you a while to get your systems back up. Your monitoring system is blissfully unaware that there’s any site downtime because it’s down too. You go to pull up your reports and you still have 100% (or close to it) uptime.

The biggest issue is that Nagios is great for introspection. Making sure the gears in your machine are operating normally. It’s pretty crappy for monitoring what your customers on the internet are experiencing, and that’s what Keynote is good at. Other nice thing that Keynote does is dig through multiple clicks of a website with a simulated or real browser. Again emulating the customer experience as close as possible.

Hopefully this helps some folks out there. I know this took me longer than it should have to get right.