Monday, October 06, 2014
Last week's well-reported server reboot by cloud provider Amazon caused a little trouble for Netflix, according to Data Center Dynamics. The streaming service managed to keep running despite over 200 nodes being rebooted and more than 20 node failures, which goes to show how effective Netflix has been at preparing itself for unanticipated problems. Go-to strategies like using remote console servers to backup data at multiple facilities are still ideal, but IT teams could be doing more. In a recent blog post, members of the Netflix IT team note that resiliency problems like node failures are inevitable. The best way that data centers can prepare for a shut-down is by staging controlled failures of their own. Testing the limits of a facility is critical to identifying the true resiliency of a data center.
Taking node failure on the chin
Netflix is tasked with managing nearly 3000 nodes, so one of the company's main goals in developing a resiliency strategy was to reduce the need for human intervention. The company began to devote resources toward automating the process of recovering Cassandra nodes after they failed last year, according to the Netflix website. Eventually, the company devised an automated process that initiates the bootstrapping of a replacement node as soon as a failure is detected and located. The company then began to test the effectiveness of their automated process by exposing the system to random node failures via Chaos Monkey.
Greater resiliency thanks to Chaos Monkey
Chaos Monkey is just one wrench in Netflix's box of resiliency tools known as the "Simian Army." Other Simian Army applications include Chaos Gorilla, which simulates the outage of an entire Amazon Web Service Availability Zone, and Janitor Monkey, which searches out unused resources and removes them. Chaos Monkey in particular was useful for simulating problems caused by the AWS reboot because the program is designed to randomly disable the virtual machines that Amazon hosts for Netflix. These programs were utilized to develop exercises for Netflix IT professionals, and these experiences were an essential tool in preparing the team for Amazon's reboot.
Perle's wide range of 1 to 48 port Perle Console Servers provide data center managers and network administrators with secure remote management of any device with a serial console port. Plus, they are the only truly fault tolerant Console Servers on the market with the advanced security functionality needed to easily perform secure remote data center management and out-of-band management of IT assets from anywhere in the world.