Friday, July 28, 2017

Netflix and the Chaos Monkey

From Safal Niveshak, July 19:

Dealing with Failure in Life and Investing: Lessons from the Chaos Monkey
Amazon Web Services (AWS) is the Titanic of cloud hosting. It provides on-demand cloud computing platforms to both individuals, companies, and governments, on a paid subscription basis. The platform is designed as a backup to the backups’ backups that prevents hosted websites – including some of the largest in the world – and applications from failing.

Yet, like the Titanic, AWS crashed in April 2011, taking with it popular websites like Reddit, Quora, FourSquare, HootSuite, and New York Times, among many others, for four days.

It faced another major outage in February 2017, which again brought a large number of key websites down on their knees.

There was, however, one site that kept chugging along well during both these instances, despite also having AWS as its host at both the occasions.

This was Netflix, the world’s leading streaming video website and one that owns a dominant share of downstream Internet traffic – almost 35%; double of YouTube – in North America during peak evening hours.

Before we understand how Netflix survived this Internet debacle, let’s understand a bit about the cloud.

The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), companies need to design a cloud architecture where individual components can fail without affecting the availability of their entire system. In effect, a company’s cloud architecture needs to be stronger than its weakest link. And it must constantly test its ability to survive these “once in a blue moon” failures, like what happened in the form of AWS outages.

Despite the 99.99 percent availability that AWS’s agreement promises, when you are on the cloud, you must believe in Murphy’s Law, “Anything that can go wrong, will go wrong.”

So, what helped Netflix survive these outages when other large sites hosted on AWS faced blackouts?
It was seemingly Netflix’s deep faith in Murphy’s Law, and thus the creation of a simian army termed the Chaos Monkey....
...MORE

HT Alpha Ideas