Like many other sites hosted on AWS, all of The Conversations Network’s websites went down at 1:41am PDT on April 22, 2011. It would be 64.5 hours until our sites and other servers would be fully restored. A lot has been written about this outage, and I’m sure there’s more to come. Don MacAskill, another early adopter of AWS, has posted a good explanation of SmugMug’s experiences during the outage. Phil Windley and I are hoping to interview our friend Jeff Barr from AWS for Phil’s Technometria podcast once the dust has settled at Amazon.
Many pundits have suggested this event highlights a fundamental flaw in the concept of cloud computing. Others have forecast doom and gloom for AWS in particular. I disagree with both arguments. While it certainly was the most significant failure of cloud computing to date, I predict this event will become not much more than a course correction and a “teachable moment” for Amazon, their competitors, all cloud architects and of course us here at The Conversations Network. For the geeks in the audience, I’m going to describe our architecture, the AWS services we utilize, and give a bit of an explanation about what happened and what we learned.
The Conversations Network utilizes three basic AWS services, plus a few more that aren’t really pertinent to this episode. Our servers are actually instances of AWS Elastic Compute Cloud (EC2) servers. The root filesystem for each server is stored in a small (15GB) AWS Elastic Block Storage (EBS) volume. Not only are these volumes faster than local storage, they’re also persistent. So if/when an EC2 instance stops, the root filesystem for that instance remains intact and will continue to be usable if the instance is re-started. [EC2 instances are booted from Amazon Machine Images (AMIs). In our case, these are based on Fedora 8 (Linux) customized to our standards. The AMIs are identical for all our servers, but the EBS root filesystems, which change dynamically once a server is booted, are unique to each server.]
We also use EBS volumes for non-relational storage. For example, we have one large EBS volume for IT Conversations and other podcast filesystems. This holds all the audio files and images used on the website. We have another for SpokenWord.org, and so on. These EBS volumes are each mounted to one EC2 instance, which in turn shares them with the other servers via NFS. Finally, we use the Relational Database Service (RDS) for our MySQL databases. Like EBS, this is a true service as opposed to a “box” or physical server.
One very important feature of EBS is that you can take snapshots at any time. For example, we make a snapshot each night of each EBS volume. We keep all snapshots of all volumes (other than the EC2 root filesystems) for the past seven days, plus the weekly snapshots for the past four weeks and the monthly snapshots for the past year. The cost of keeping a snapshot is based only upon the incremental differences since the previous snapshot, so it’s quite a reasonable backup strategy even for large volumes so long as they don’t have changes that are both major and frequent.
Designing any server architecture, cloud-based or otherwise, requires that you consider the failure modes. What can fail? What will you lose when that happens? How will you recover? Automatically or manually? How long will recovery take for each failure mode? It’s not about eliminating failures — you can’t really do that. Rather, it’s about planning to deal with them. And like traditional architectures, the cost of the configuration increases geometrically as you increase the reliability (ie, decrease the amount of time it will take to recover from a failure).
We’ve been using AWS for more than four years. During the period when IT Conversations was part of GigaVox Media, we were the basis of one of the first case studies published by Amazon. [Here’s a diagram of one of our AWS-based configurations.] Because The Conversations Network (a non-profit) runs on a shoestring budget and can’t afford the level of redundancy deployed by some commercial enterprises (eg, SmugMug), we’re not looking for a particularly high-reliability architecture. Until last week, we’ve have EC2 instances that haven’t stopped in well over a year. We can’t tolerate any significant loss of data so we need the redundant storage of EBS, but a 99.9% uptime is good enough for us, and that’s what we’ve had from AWS until now. Because of our experience with the high-reliability of AWS, we have never gotten around to automating the re-launching of EC2 instances in case of failure. We do use two separate monitoring services, and there are two of us (me and Senior Sysadmin Tim) who are capable of restarting servers, etc., if something does go wrong.
AWS operates in five regions around the world. We happened to pick US East in Virginia instead of US West (northern California) for no particular reason. Within each region there are multiple physical locations called availability zones. These are probably separate data centers within a metropolitan area. The availability zones within a region are connected by very high-speed fiber. This means you can have some degree of geographic redundancy by deploying servers in multiple availability zones, or achieve even greater protection by also deploying duplicate systems in multiple regions. The latter is far more complex, since the connectivity between regions is not as good as between availability zones. Our needs are humble, so all of The Conversations Network EC2 instances, EBS volumes and RDS databases are located in the us-east-1a availability zone. And of course, that’s where last week’s failures occurred.
Amazon hasn’t yet said what the original failure was. All of our EC2 instances were running and they could communicate with the RDS databases. I think the problem might have been the association between the EC2 instances and the EBS volumes. The volumes used as root filesystems were reachable, but not the others that contained our site-specific files.
After a few hours of downtime, I decided to re-boot our EC2 instances and that’s when things went from bad to worse. All of our EC2 instances entered the Twilight Zone. They were stuck in the “stopping” state. The operating system halted (no SSH access) but the servers didn’t release their EBS volumes. I could have launched all-new EC2 instances, but I wouldn’t be able to connect them to the volumes and hence, no websites.
Because of our backup strategy, however, we did have one more option: We had snapshots of our EBS volumes. I could have created all-new EBS volumes from the daily snapshots, and I could have done so in a different availability zone to get away from the problems. But there was one gotcha. We make the backup snapshots at 2am Pacific time each night. The failure occurred 19 minutes before that, which means our snapshots lacked the most-recent 24 hours of activity: new programs, audio and image files, logs, etc. As with the few previous problems we’ve had with AWS (mostly of our own causing) we thought this outage would be fixed quickly. It was a tradeoff: It seemed better to wait an hour or two rather than to re-launch with day-old data.
Of course “an hour or two” dragged on. Soon the outage was 24 hours old; then 48. It always seemed that the fix was imminent, so we delayed the restart process. Eventually, we decided to go ahead, and that’s when we discovered our one real mistake. Remember that we make snapshots of our EBS volumes every night? Well it turned out that we weren’t making those snapshots of all of our volumes. There was one volume that we somehow missed. The only snapshot we had of that volume was from the date it was created, more than a year ago. That means we would have had to launch our sites with some very old data. In this case, when we finally got access to the most-recent data (on the in-limbo EBS volumes) it would be difficult to reconcile it all. In the end, we decided just to wait it out. Finally, after 64.5 hours, the one EC2 instance that was holding hostage our last EBS volume stopped. We were then able to re-attach that volume to a newly-launched instance. We brought up all-new EC2 instances, attached all the then-current volumes and we were up and running, still in availability zone us-east-1a.
So what did we learn from all this? We re-learned that you have to think through these architectures carefully and understand the failure modes. But most importantly, we learned that once you have a good plan, you have to follow through with it. If we had been making nightly snapshots of that one remaining EBS volume all along, we would have been able to re-start the websites with day-old data at any time, regardless of the problems AWS was having disconnecting EBS volumes from running EC2 instances.
I also have a new strategy for deciding when to stop waiting for AWS to recover and instead switch to the snapshots: Once the length of the outage exceeds the age of the backups, it makes more sense to switch to the backups. If the backups are six hours old, then after six hours of downtime, it makes sense to restart from backups. In this case, we should have done that after the first 24 hours.
But we still know we don’t have ultimate redundancy: We still have to re-start things manually. So long as we accept the downtime, we can survive the total failure of the us-east-1a availability zone and even the entire US East region. That’s because all EBS volumes are first replicated to multiple availability zones within the region, and our nightly snapshots are stores in Amazon’s Simple Storage Service (S3), which is replicated across multiple regions. So our current data can survive a failure within a region and our day-old data can survive a failure of our entire region.
We still have a few things to cleanup and repair from this experience, but all-in-all we remain fairly happy with how things turned out. We didn’t, after all, lose any data. And while we aren’t proud that our sites were down for nearly three days, the world as we knew it did not come to an end. Maybe our team is even glad to have a few days off. (Too bad we couldn’t have told them in advance.) We still have one EC2 instance that refuses to stop, but it’s one of those that used NFS to reach EBS volumes attached to another server. Amazon says “We’re working on it.” Other than that, we’re now better prepared for the next failure, so long as its just like this one. Actually, I think we’re in pretty good shape for most events I can foresee. AWS. It continues to be a great platform for us.