The data center that houses Nirvana’s front-end web servers suffered a prolonged power outage last night. It took a lot longer than anyone expected (or wanted) before the thousands of VMs at their facility, including ours, were able to be brought back online. The high-level sequence of events is available here.
We’re back online now and everything is running normally.
Please accept our sincere apologies if you were unable to login this morning.
We don’t usually talk much about our infrastructure (as we figure it’s not that interesting to most GTD’ers), but given recent events I suppose this might be one of those times when people might like to know more.
Over the past few months we’ve been incrementally migrating our infrastructure to a geographically distributed and fault tolerant cloud architecture, hosted at Amazon AWS. They are truly amazing, and we are in good company.
Our databases are already running as multi-AZ replicated RDS instances, so they were unaffected by the power outage.
In a frustrating twist of fate, we had planned on moving the remainder of our web servers to AWS yesterday, but decided to push the migration back a week (as we’ve been working a lot of Saturday nights lately and we kinda wanted a break), and Linode, where we’ve been happily hosted for years, winds up having a major outage — the exact type of event we have been working hard to mitigate by moving to our new architecture. Arrrrgh.
Having our servers auto-scaling and load-balanced across multiple data centers will significantly reduce the chances of outages in the future. In light of last night’s events, it can’t come soon enough. Thanks for sticking with us, and sorry again for the unexpected downtime.