As you may be aware, we experienced a brief Cloud Data outage on Monday, September 10, at approximately 9 a.m. The outage only lasted about 15 minutes, and no data was harmed or lost.
The outage resulted from a misconfigured test server – we were attempting to configure a test server to investigate methods of making the Cloud Data replication faster. Unfortunately, we mistakenly included some configuration information for the primary Cloud Data pool, which caused an issue with the replication, thus bringing down the entire system. As you can imagine, we have several lessons learned on this episode, all of which will contribute to making Cloud Data even more resilient. In the meantime, we’ve configured the test server correctly, which should also help us find some tweaks to make Cloud Data even more responsive.
The backbone of our Cloud Data is a replication system across five sites worldwide. This system provides an immediate backup of all data, along with an instant fail-safe recovery in the event of even the most minor Internet outages. It also comes with a certain degree of complexity – an issue with the replication system itself will require a fixed amount of time (about 10 minutes) to bring the servers back online. And yes, this too is something we hope to find a way to improve with our new test servers.
This incident was only the second major Cloud Data outage we have experienced over the past year, and our recovery procedure has been able to limit the outage to 15 minutes or less each time. We want to reassure users that this is an extremely rare event, with the incredible backup and redundancy capabilities built into Cloud Data. As a reminder, with Local Data, even the slightest hardware or software failure would result in a one- to three-day downtime, with considerable effort necessary to restore all data. As you undoubtedly noticed after this brief outage, all your data was available and intact, up to the very last click.
Should any future outage like this occur, you will be able to check either our website or our Twitter and Facebook accounts for status reports and estimated time to operating capability.