The Downtime Dilemma: Reliability in the Cloud
When I’m not blogging for Software Advice, I like to do a little personal writing of my own. I use Google’s Blogger as my platform for reflection. A couple of weeks ago, I tried to create a new post, but like thousands of other Blogger-ites, I was unable to do so. After a quick search on Twitter and various user boards, I realized Blogger was down.
The application was unavailable for about 20 hours. This outage is just one in what seems to be a string of recent cloud failures. Amazon’s EC2 is probably the biggest fail story lately. But Microsoft’s BPOS hosted bundle also experienced a significant amount of downtime recently. And, earlier this week, little monsters everywhere went gaga when Amazon released a digital copy of “Born This Way” for 99 cents, causing Amazon to experience another unfortunate crash.
These incidents have been covered extensively on the major tech news outlets, leading the technorati to once again question the reliability of cloud computing. One contributor wrote on the Microsoft Service Forum:
“Back to in-house servers we go, I suppose. This string of incidents will set the cloud/off-site model back months, if not years, I fear…"
When things go awry in the cloud, many companies are affected. Because these periods of downtime are public knowledge, it creates a misconception that cloud computing is unreliable and should be avoided. However, when things falter with on-premise systems, it is hidden behind the corporate curtain.
Despite cloud computing’s proven track record of success and gaining popularity as a cost-effective solution, it’s still managing to get a bad rap. Even with these highly visible incidents in the media recently, is this bashing of cloud computing really warranted?
Downtime in the cloud
Anyone who has ever purchased a cloud-based software system is familiar with the Service Level Agreement (SLA). In the SLA, the provider commits to a percentage of up-time, or amount of time the system can be expected to run without interruption. Ideally this would be 100%, but as with most technology, hiccups in service delivery are inevitable.
When creating the SLA, vendors take into account regularly scheduled maintenance, as well as unplanned outages or downtime. After making those considerations, most cloud companies can still quote about 99.9% up-time. That looks pretty impressive and seems to be in line with the kind of performance we have come to expect from SaaS vendors. Unfortunately, naysayers still like to harp on that .1%.
Even though cloud systems are recognized as the cash-flow-friendly alternative to on-premise systems, we still have the traditionalists that refuse to embrace the cloud. Many prefer to instead dwell on the “what ifs.” What if the host’s servers go down? What if mission-critical data is lost? While these are clearly valid questions, for many on-site purists, what it really comes down to is control. Users feel more secure when they are in control of the system. However, Walter Scott, CEO, GFI Software, offers a reminder:
“Cloud-based solution vendors not only have the latest technology, the latest firewalls, the best data centers and the highest levels of redundancy possible but they will apply multiple layers of [in-depth defense] that your average business (a Fortune 500 company may be an exception) can never have."
Downtime on the ground
Like their cloud computing counterparts, on-premise systems make promises on up-time. The difference is that when outages occur inside organizations, we typically don’t hear about it. Therefore, the perception of the always-on on-premise model is skewed.
This lack of coverage also makes it difficult to track down any data regarding the performance of on-premise systems. However, the Radicati Group conducted a study in 2008 on on-premise email solutions that exposes some interesting points.
Most notable in the findings is that among the most popular email systems (Microsoft Exchange, IBM Lotus Notes, etc.), there was an average of 30-60 minutes of unscheduled downtime per month. On top of that, there was an average of 36-90 minutes of scheduled downtime. That stands in stark contrast to Gmail’s total downtime of 10-15 minutes.
Clearly, based on these findings, servers can and will fail on occasion no matter where they are being hosted. And from this chart, one might deduce that cloud companies are more efficient at getting back online than companies that host their own servers.
Getting to 100%
There is one foreseeable upside to this negative press: it puts a fire under the backsides of cloud computing vendors to constantly improve and stay on the leading edge of technology. I spoke with Denis Pombriant of Beagle Research Group about an article he wrote recently in which he discusses reliability in cloud computing in terms of what users expect from vendors:
“You have to be always up, always available. So, what does that mean? It means that you can’t have a single point of failure.”
That is a tall order, but it’s what the user requires. So, how can we achieve this standard? For starters, Pombriant proposes better system modeling in the cloud. In other words, the architecture needs to be improved.
“If you’re going to have a truly robust and reliable infrastructure, you’re going to have to build much greater reliability into your systems,” he says. “Take electric utilities these days. They all have more generating capacity than is online at any one time because they take plants down, and then they put them up. They eliminate all of the obvious possibilities for failure. That’s what cloud computing has to evolve towards.”
Denis makes a really great point. Although, the current cloud infrastructure is probably about 10 times as redundant as most on-premise systems. I think the cloud is simply suffering from the consequences of fame. On-premise systems experience the same failures as cloud systems – probably more – but cloud is the “celebrity” model right now, so it gets all the attention, good and bad.
Think about it. Arnold Schwarzenegger isn’t the first guy, or politician for that matter, to have a child with a mistress, but because he was a governor and, more importantly, the Terminator, we hear about it when he does. I do apologize to the cloud for that comparison. The cloud is far more reliable than Arnold, but you catch my drift.