The total downtime was 3 hours 27 minutes. Users couldn't see/modify their information, but no data corruption occurred.
We are sorry to everyone who was unable to access their data during that time. We've already taken some measures to prevent such problems in the future, see more details below.
What went wrong
- On Saturday, Feb 12, we've updated software on the production server (ruby, passenger, nginx).
- At 04:13 UTC a daily server maintenance procedure was started. Some parts of the procedure were incompatible with the new installed software, and access to Checkvist was broken.
- E-mail and SMS notification were sent to us. We didn't see e-mail notification because we were away from computers. The SMS notification was sent to an obsolete phone number :( So the problem was unnoticed until the morning, when we've checked e-mail.
- We've post to our @checkvist_news twitter account that the problem is being resolved
- We've saved the server logs for investigation
- After finding out the cause of the problem, we've restored the server functioning
Measures to prevent similar problems in the future
- Daily maintenance procedure was corrected
- We've set another time for the daily maintenance, so it would take place in the daytime
- We've updated our monitoring service. Existing phone number for SMS notifications was corrected, we also added another phone number for the notifications. We've also installed iPhone app which can notify about service failures.