Feb 13, 2011

Checkvist downtime postmortem

Checkvist service was unavailable since Feb, 13 04:13 UTC till 07:40 UTC.

The total downtime was 3 hours 27 minutes. Users couldn't see/modify their information, but no data corruption occurred.

We are sorry to everyone who was unable to access their data during that time. We've already taken some measures to prevent such problems in the future, see more details below.

What went wrong

  • On Saturday, Feb 12, we've updated software on the production server (ruby, passenger, nginx).
  • At 04:13 UTC a daily server maintenance procedure was started. Some parts of the procedure were incompatible with the new installed software, and access to Checkvist was broken.
  • E-mail and SMS notification were sent to us. We didn't see e-mail notification because we were away from computers. The SMS notification was sent to an obsolete phone number :( So the problem was unnoticed until the morning, when we've checked e-mail.

Problem resolution

Measures to prevent similar problems in the future

  • Daily maintenance procedure was corrected
  • We've set another time for the daily maintenance, so it would take place in the daytime
  • We've updated our monitoring service. Existing phone number for SMS notifications was corrected, we also added another phone number for the notifications. We've also installed iPhone app which can notify about service failures.
And finally, why all these explanations. I think it's better to be transparent and honest about the problems than make people guess about system's reliability.