Amazon Web Services (AWS) are responsible for the largest adopted Cloud service and hosting platform for more of the world wide web than any other, this is including both Microsoft Azure and Googles Cloud efforts, the ironic thing recently was that the reason more people chose to adopt AWS was it’s reliability, a fact which wasn’t entirely true as a large chunk of Amazon’s Web Services Servers went down, causing multiple websites to present no imagery, or in some cases, where the entire site was hosting using Amazon, not load at all.
Amazon identified the issue hours after it became apparent through their Support Status website, however the problem wasn’t fixed till much later. Amazon has blamed a “typo” for a massive cloud-computing outage which caused problems for thousands of websites and apps. Amazon have naturally apologised for the five-hour outage of some Amazon Web Services, which as we mentioned affected a ton of websites including popular online services such as Slack, Trello and Medium, but also a ton of small players too.
Amazon has now revealed exactly what went wrong, explaining that an incorrectly typed command during a routine debugging of its billing system was to blame, also known as a fat finger typo.
The employee only “intended to remove a small number of servers”, but instead the typo caused unprecedented performance problems for thousands of companies that rely on Amazon’s cloud-computing service.
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended”
Naturally restarting and closing certain Server banks is essential for system maintenance, though you have to agree that such an error to be caused by such a simple mistake is rather hilarious, both in a comic statement, but also in terms of how easily it happened undetected inside Amazon, but at the same time we’re sure Amazon will more than do everything they can to prevent a repeat issue.
Amazon defended the time to bring the service back to life on the fact its Simple Storage Service has “experienced massive growth” over the past few years, adding that “the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected”.
Amazon CEO Jeff Bezos has also confirmed as we expected, that the company is “making several changes” to prevent similar incidents in the future. The company also left this statement apologising to customers affected.
“We want to apologise for the impact this event caused for our customers. We know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further”
When such a high profile service goes wrong, it really does highlight just how much we rely on bring online all the time, hopefully no further issues will emerge, or at least soon as problems are obviously unavoidable at all times, nothing is 100% in this world.