March 22, 2017

One Wrong Click, One Massive Outage

On February 28, 2017, an Amazon Web Services (AWS) engineer was performing software maintenance and made an accidental change that took the Simple Storage Service (S3) offline for the U.S. East 1 region. The employee, Amazon reported later, was debugging an issue with the company’s billing system. Instead of taking offline the single system in question, the engineer made multiple other services inoperable. As such, a domino effect ensued, taking down two other server subsystems as well. This mistake impacted a large portion of the Internet. As the largest cloud provider available, it’s no surprise that organizations such as Netflix, Reddit, Imgur, Giffy, Medium, Slack, Quora and many others were affected.

All of those companies impacted felt the burden of this mistake because they rely heavily, or even entirely, on AWS for their cloud services. Those companies are not fully distributed to other regions or clouds, and as such, when a crash as massive as this one occurs, they can be as adversely affected as the very center of disturbance.

So what does this all mean to your business? Well, to start, we can learn a great deal from every disaster recovery incident that occurs. But AWS and the weight of its outage serves as an especially good example. Here are some lessons to consider when it comes to understanding and best utilizing the cloud:

  • The cloud does not automatically mean disaster recovery.
  • Organizations must architect their solution for high availability in the cloud. This includes, but is not limited to, redundant servers, multiple availability zones, multiple regions and multiple clouds where possible.
  • Moving to the cloud requires forethought and planning with a proper design to ensure consistent, and perhaps even constant, availability.
  • Planning for availability and disasters should be part of any Cloud migration plan.

Although the Amazon outage root cause was identified as human error, it is clear there is an increased dependence on a handful of services, like AWS, to power large swaths of the Internet today. The tech industry's lack of diversification and proper availability planning is largely to blame for this incident; and other internet-wide crises will occur if something doesn't change, and quickly. That is to say that while AWS and its rivals make cloud utilization convenient and relatively affordable, the reliance on a small number of systems without additional redundancy planning across regions or vendors is an inherent weakness in many of today’s largest cloud deployments. This limited reliance increases the chances that a failure, similar to last month’s, will occur again and again.

Employing cloud migration services like those offered with CGS would diversify your risk and ensure faster recovery times if an outage were to occur, while also offering the personal touch of a smaller, scalable cloud services provider. Trust CGS with your company's well-being to stay safe and away from outages like AWS’s, helping you rest easy.

Read 4 Tips to Secure Your Business Against Cyber-attacks and keep your business safe.