Making the Cloud Safe for IT
Author: Ashish C. Morzaria   Time: 1:00 pm           In: General| News
On July 2nd, the Google App Engine experienced about six (6) hours of downtime in what many would defend as an inevitable reality of Cloud Computing. Even the Apps Engine Status Page was down for atleast four hours, so people looking for an update were left without a confirmation or denial of the problem.
Hosting provider Rackspace also had a massive outage right after Michael Jackson’s death - undoubtedly due to the influx of additional traffic on their customers’ sites - hey, even Google thought they were getting a denial of service attack that day. Wikipedia has an entry calling this the “Michael Jackson” effect.
Enterprises are understandably concerned about uptime and availability if they rely on applications or systems in the Cloud. What if something goes wrong? Atleast with on-premise software, there is usually a poor IT person who can provide some visibility into the problem, its severity, and the potential downtime expected. However, as Google App Engine users found out yesterday, there is no visibility into the problem if you cannot even connect to your provider.
What we sometimes fail to remember is that traditional on-premise software goes down all the time too - whether that is Microsoft Exchange, your Oracle database, or even your server operating system (Windows or Linux, I don’t care - bugs are bugs). That’s what clusters are for, and even that does not move you past the mythical “5 nines” reliability metric.
When was the last time you went to google.com and got a “Service Unavailable” message? If you do get any kind of error, 99.999% of the time, the problem is on your end - because Google is that good. How many people verify they have a working Internet connection by going to google.com?
A look at the recent history of the App Engine Status Page shows that on any given day, there are atleast a few “areas of investigation” which are probably service affecting, but not service disrupting - life goes on. Compare that with a certain SMS-like messaging site that has some kind of actual service affecting problem every few days and “Sev-1″ issues that last for weeks (when “Following” someone is a core feature of your product, it better be fixed now, not when you get around to it).
Even the apparently infallible Google succumbs to the odd outage. How your vendor deals with the outage is very important - are they going to put every available person on it, or are they going to take it easy because atleast some part of their service is still running?
(I actually know of atleast one vendor that has an SLA that they do not commit to because their upstream Cloud provider’s SLA isn’t enforced either - I fail to see the purpose of these “SLA”s if there is no “A”.)
In the end, the Cloud world presents the same problems and pitfalls as the on-premise world. So expect Google Apps, Amazon Web Services, Salesforce.com, Azure, Twitter, and even Facebook to have outages - how they deal with their outages (and their customers) is far more important than the length of the outage itself.
Comments are closed.