Netflix explains how "Zuul" keeps streams uninterrupted while self-healing the servers

Cal Jeffrey

Posts: 4,176   +1,424
Staff member
In context: Traffic congestion can cause severe problems for online content providers. There are several ways to address backend loads, but the handling becomes more complicated with streaming services. Netflix's implemented a load filter earlier this year called "Zuul" that dynamically handles server requests in real-time by prioritizing loads in a way that allows the backend to "self-heal" when there is a problem.

On Monday, Netflix engineers posted an engaging explainer of how it uses "prioritized load shedding" to ensure users' viewing experience is as uninterrupted as possible. As late as last year, the streaming service has suffered outages caused by load congestion. It now has a "priority throttling filter" that can shed unnecessary server requests in real-time whenever there is a problem on the backend.

In a nutshell, the filter, which Netflix dubbed "Zuul" prioritizes traffic base on how much a user needs it for playback. The system uses three buckets to categorize server requests—non-critical, degraded experience, and critical.

Non-critical items include logs and background requests, and according to engineers, it makes up a large portion of system throughput. Even so, these requests can usually go ignored when the server load reaches a certain threshold.

Degraded-experience items are not necessary for playback of content but are used to improve the user experience. Stop and pause markers, language selection in the player, and viewing history are examples of server requests that can be shed when problems arise on the backend. Most of the time, users will not even notice that these items are missing, particularly while watching content.

The critical bucket is for traffic that affects users' ability to play content. If these requests go down, trying to play a movie or show will result in an error message.

As a first step, Zuul scores each of these items between 1 and 100. If problems develop on the backend, or even with Zuul itself, the filter can throttle loads with the lowest priority first. Serving playback content always gets preferential treatment over everything else, so when there are hiccups, they go largely unnoticed by most viewers.

As to the system's effectiveness, Netflix points to a 2019 outage that prevented a "sizable percentage" of subscribers from playing content. Earlier this year, just days after implementing the filter, Netflix experienced a similar failure. However, this time Zuul kicked in and started shedding loads until the backend was stable. Users on the frontend experienced no interruptions.

"Unlike then [the 2019 outage], Zuul's progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members' ability to play at all," say engineers. "Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure."

We have provided just a brief overview of how the system functions. If you are interested in the technical details, Netflix has a full writeup on Zuul. It's a good read if you are interested in the backend workings of online services.

Image credit: Bogdan Glisik

Permalink to story.

 
"Non-critical items include logs and background requests, and according to engineers, it makes up a large portion of system throughput."

I found this sentence very intriguing. Given the non-large-portion is delivering say a 2 GB movie, it really makes me wonder what all those logs and background requests are, and why they are handled by the same server.
 
Meanwhile at BSkyB: "We heard you like to pay 50 quid a month to watch 160x90 resolution digital noise, supposed to be HD content, so that's what you get!" :)
 
Back