Be careful what you ask for; a lesson in monitoring in code
This morning was like any other—wake up late, shower, get stuck in traffic—but with a twist. As soon as I hit traffic, which was really bad even by I-85-in-midtown-Atlanta standards, my smart watch started to buzz every few seconds. Trio's servers are unreachable according to our monitoring service. It's going to be one of those mornings. When our monitoring service tells me something is wrong, I have a mental checklist, and I began working it (more-or-less in order).
- Check NewRelic for performance issues.
- Check Heroku's status page for outages.
- Check AWS' status page.
- Submit support ticket to Heroku, if applicable.
- Look into any stop-gaps I can put in place to lessen the impact.
- Inform users of potential issues.
I was drafting a user notice to pop-up in-app when something I saw hit me: NewRelic also has monitoring turn on and those alarms weren't going off; why not?
Eureka!
There is a fundamental difference in how NewRelic and Pingdom perform their monitoring—Pingdom follows redirects and tests the final destination whereas NewRelic (by default) accepts a redirect as a success. This little detail is important because a while back, we decided to throw away our single-page, teaser, homepage and simply redirect users to our iTunes store page. So, since we made that change our monitoring service hasn't actually been monitoring our website, but we've been monitoring the iTunes store. Oops.
With a quick fix and a flick of the wrist, Pingdom is now fetching our robots.txt
file. This has two benefits: it's actually monitoring that our server is alive and accessible and it's fetching a static page, so there is very little overhead on the server.
TL;DR When you're setting up HTTP monitoring, create a static file and monitor that file's url.