Netflix Details Its Live Streaming Infrastructure and What They Learned While Building It
It was only a matter of time before Netflix shared more details on its live streaming infrastructure. Today, in the first of a series of tech blog posts, Netflix provides a detailed examination of the architecture behind its live streaming events and the lessons learned during their development. As Netflix rightfully points out, a unique position for the company is the ability to build support for a single product and have control over the full live lifecycle, from production to screen.
Netflix leverages AWS MediaConnect and AWS MediaLive to acquire feeds in the cloud and transcode them into various video quality levels with bitrates tailored per show. Their cloud-based approach enables dynamic scaling, flexibility in configuration, and seamless integration with their DRM, content management, and content delivery services, which are already deployed in the cloud.
Netflix built a custom packager to better integrate with its delivery and playback systems and also built a custom live origin to ensure strict read and write SLAs for live segments. Content is delivered via its own CDN (Open Connect), with more than 18,000 servers located near viewers at over 6,000 locations and connected to AWS locations via a dedicated Open Connect backbone network.
Netflix utilizes HTTPS-based live streaming due to its widespread support among devices and compatibility with delivery and encoding systems, bypassing UDP even though it would provide ultra-low latency. They utilize AVC and HEVC video codecs, transcode with multiple quality levels ranging from SD to 4K, and employ a 2-second segment duration to balance compression efficiency, infrastructure load, and latency. Netflix delivers the manifest from the cloud instead of the CDN, as it allows them to personalize the configuration for each device.
Netflix said its real-time QoE monitoring is built using a mix of internally developed tools, such as Atlas, Mantis, and Lumen, and open-source technologies, such as Kafka and Druid, processing up to 38 million events per second during some of its largest live events while providing critical metrics and operational insights in a matter of seconds.
See their post for more detailed information.