How CDN Switching Blind Spots Lead To Rebuffering

Reducing video rebuffering can be difficult. One solution that many people are talking about these days is moving to a multi-CDN architecture, a topic I’ve written a lot about. But will going multi-CDN magically reduce your rebuffering and drive away all of your streaming ills? The answer, of course, is complicated. Going multi-CDN can provide several benefits for customers, such as better geographic coverage and, possibly, better economics. Adding live switching logic between the CDNs goes a step further and enables load balancing and redundancy in case of problems. But what are the problems that customers are likely to encounter? Let’s examine some common CDN problems and their impact.

Catastrophe and Chaos
Once in a blue moon, a CDN will experience a major outage that affects a large geographic area. These outages are so extreme that they shut down a large portion of the Internet for a non-trivial amount of time. Recent examples of such outages:

Detectability: Very easy to detect as any metric that you care to measure will explode. Your alerts will fire or, if you don’t have alerts, you’ll get messages and phone calls from users.
Solution: A good CDN switching engine will attempt to re-route users to a working CDN within the limits of the outage.

PoP Flop
Occasionally, a CDN might experience a local issue in one of its PoPs (Point-of-Presence), which means all of the users that are routed through that specific PoP will have problems fetching video segments, and will most likely experience rebuffering.

Detectability: Medium or hard depending on the percentage of users that are affected. A large portion of the traffic would skew the metrics enough to create an anomaly, while a small portion might get swallowed within the geographic granularity of the monitoring system.

Solution: A good CDN will re-route the impacted traffic to a different PoP within its network which will cover for the faulty PoP with best-effort performance. A good CDN switching engine will eliminate the faulty CDN altogether from that region and just use a non-faulty one.

Smaller scale outages are so hard to detect that you might never know they ever happened. Have your users experienced such an outage? The answer is most likely yes, since all CDNs see dozens of them as part of their daily monitoring efforts. In the industry, we call these events “blind spots”.

The Blind Spots of CDN Switching
There are 3 main blind spots that server-side CDN switching engines do not address very well:

  • #1. The DNS Propagation Problem. “We already know there’s a problem, but we have to wait at least 5-10 minutes for DNS to propagate.” A common CDN switching implementation is based around DNS resolving. The DNS resolver incorporates a switching logic that responds with the best CDN at that given moment. If one of the CDNs in the portfolio experiences an outage or degradation, the DNS resolver will start responding with a different (healthy) CDN for the affected region. The blind spot of DNS would be its propagation time. From the moment the switching logic decides to change CDNs, it might take several minutes (or longer) until the majority of traffic is actually transitioned. Moreover, while most ISPs will obey the TTL (DNS response lifetime) defined by the DNS resolver, some will not, causing the faulty CDN to remain the assigned CDN for the users behind that ISP. Rebuffering on existing sessions is inevitable, at least until the DNS TTL expires on the user’s browser.
  • #2. The Data Problem. “We select CDNs based on a synthetic test file, but real video delivery is much slower.” Any switching solution must implement a data feed that reflects the performance of the CDNs in different regions and from different ISPs. A common approach for gathering such data is to use test objects that are stored on all of the CDNs in the portfolio. The test objects are downloaded to users’ browsers, which then report back the performance that was observed. Often times, for various reasons, the test objects don’t represent the actual performance of the video resources. For example, imagine that the connection between the origin and the edge server is congested – the test object will not be impacted since it’s already warmed in the cache of the edge server and does not need to use the congested middle mile connection to the origin. This performance gap will cause a CDN to be erroneously selected as the best one even though the reality differs. It’s also possible that the test objects and the actual video resources do not share the same CDN bucket configuration. If the video resources bucket is misconfigured, some users might get unoptimized or even faulty responses which at no point will get detected because the test objects bucket functions properly. This kind of disconnect between synthetic performance measurements and actual delivery performance often generate degraded performance that gets undetected for a very long time.
  • #3. The Granularity Problem. “We select CDNs based on overall performance in each region, but this stream is performing poorly for a subset of the users.” A typical CDN Switching flow might be:
    • measure CDN performance across different regions
    • report results to the server
    • server chooses the “best” CDN
    • users are assigned to the best CDN for their region

    Unfortunately, not all regions have fresh performance data all the time and so a fallback logic is usually applied. When there isn’t enough data in a specific region, data from its greater containing region will be used instead. It’s possible that a region with small amount of users gets swallowed up by a larger fallback region, in which case an outage might not be detected at all because the affected users comprise a small portion of total traffic that is not enough to “move the needle”.

    This is a data granularity problem. A broadcaster might have 100k users that are spread across 15 countries, 1,000 unique regions, 5,000 ISPs and a host of other parameters. Taken together, these parameters segment the user base into millions of tiny dimensions, none of which will have enough data to perform meaningful switching decisions, not to mention the load it will create on the switching system. For this reason, server-side switching is inherently limited to a more coarse grouping that is technically and mathematically viable. This reality creates a blind spot when it comes to smaller regions that might get hit by a local, undetectable outage.

Real World Example of The Granularity Problem
In the last week of August, an outage occurred in the U.S. which demonstrates the granularity problem. An outage by a CDN I won’t name, caused a significant drop in request performance that, in turn, led to rebuffering. Thanks to Peer5 for sharing the screen grabs (below) from their monitoring tool showing that at 7:20 AM, an increase in Time-To-First-Byte(TTFB) was observed from an average of 850ms to a peak of 6700ms. When comparing the 95th percentile TTFB of the affected area to its greater containing region, it’s clear that the affected area didn’t constitute enough data to move the overall metric.

95th Percentile TTFB – Affected area vs Greater Region

The greater containing region (blue) doesn’t show any anomalies throughout the outage. (less is better)

Rebuffer Time as %

Rebuffering spikes render the playback unwatchable. (less is better)

Allocated CDNs Over time

While this chart might seem dull, it illustrates the fact that throughout the entirety of the outage, no CDN switching took place for the given region.

Enter: Per User CDN Switching
Video playback is a very fragile thing. A user might have just a couple of seconds of content buffered ahead and any slowdown in fetching segments can easily consume that buffer and freeze the playback. For this reason, vendors in the market are coming up with ways to fix the problem. For instance, Peer5 created a client-side switching feature which constantly monitors the playback experience for each individual user and is able to react to poorly performing CDNs within a split second (literally, milliseconds) and prevent rebuffering from ever happening. This means that even an outage that only affects one user will be accounted for. The below charts shows the performance during the outage described above with and without a client-side switching feature.

95th Percentile TTFB – Affected area vs Greater Region

The TTFB of the client-side switching group (green) was affected as well but much less than the other group. (less is better)

Rebuffer Time as %

The client-side switching group (green) experiences almost no interruption in playback. (less is better)

As seen in the graph above, users that relied solely on server-side switching (red line) were impacted significantly, compared to users with client side switching. Server-side CDN switching was not granular enough to detect the local outage and the assigned CDN for that region remained the same even though some users experienced terrible performance degradation. The client-side switching, with its per-user granularity, was able to change the mixture of CDNs within the region and avoid the issue in real-time. The rebuffering was reduced from 11.2% to 0.2% for client-side switching enabled users, and the overall region rebuffering was reduced by 70% from 1% to 0.3%.

When CDNs experience outages, users will encounter rebuffering. There are multiple types of outages, some will go below the radar completely undetected while some will make you notice them immediately. Different layers of redundancy and different levels of granularity tries to address the various outages an online delivery pipeline might experience. A combination of several such redundancy tools is likely to achieve the best UX. Employing server-side switching alongside client-side switching allows customers to:

  • Reduce rebuffering by monitoring video playback constantly for all users
  • Allow existing sessions to respond to outages very QUICKLY by switching CDNs on a per request level
  • Improve bitrate and quality by increasing the granularity of CDN selection to a per-user level

There’s lots of ways to solve the video buffering problem depending on what type of video you are delivering, (live vs VOD), the platform or devices you are delivering it to and the user-experience you are looking to achieve. What’s your take on the best ways to reduce video buffering? Feel free to leave them in the comments section.

Streaming Summit Program and First Set of Speakers Announced: Hear from CBS, Quibi, NBC, YouTube, Twitter, Amazon, WarnerMedia/HBO

I’m pleased to announce the first set of speakers for my Streaming Summit, at NAB Show New York, taking place Oct 16-17. The program schedule has also been added to the website and when completed, we’ll have over 100 speakers, across two days of the show. Newly added speakers include executives from CBS Interactive, Quibi, NBC Sports, YouTube, Twitter, Amazon Fire TV, WarnerMedia/HBO – with lots more on the way! Register before Sept 12th using code “early” for a discount on your ticket. You can see the entire agenda on the schedule page. #streamingsummit #nabshowny

Podcast: Disney+, Quibi, HBO Max. What Happens When Content Owners Go Direct

Thanks to Beamr, Mark Donnigan and Dror Gill for having me on their “Video Insiders” podcast to talk about on Disney+, Quibi, HBO Max, Hulu, ViacomCBS, and what the forthcoming D2C launches mean for incumbents, including Netflix and Pay TV operators. Hear my thoughts on content aggregation and ideas for measuring success in OTT, along with the technical plans and platform choices being made by these developing services. This is a frank and honest, real-time and real-world conversation about the groundswell of direct to consumer OTT services which will be unleashed over the next few quarters.


You can listen to the podcast here:

The Challenges and Best Practices For Inserting Ads Into OTT Downloadable Videos

When Disney+ launches in November, one of the unique features of the service is that 100% of their video catalog will be available via download, for offline viewing. As OTT services evolve, more consumers are going to expect content to be available offline as part of the user experience. While Disney+ won’t include adds within their videos, many AVOD providers are looking at how the feature presents a huge revenue opportunity, especially for mobile. Yet, given how new the tech is, there are many questions AVOD providers have about how the technology works and whether or not it can truly integrate into an ad-supported business model.

For the most part, the reason AVOD providers haven’t offered download capabilities in the past (while some SVODs have) is that delivering video advertising offline adds a lot more technical complexities. Most importantly, any ad-based video download feature must include processes to ensure that ads are always able to remain timely and monetized and it’s harder to do measurement and reporting of ad viewership. If a viewer is served an ad past a date when it can be monetized, time is wasted and money is lost. So if a viewer downloads a video, providers have to be certain the ads attached will still be current, even if the viewer doesn’t watch the asset for two weeks. This is crucial because if the initial ads downloaded with a video asset are, for example, promoting a Labor Day sale at a retail store, they will no longer be relevant if the viewer watches after the holiday, and the ad creative is then no longer profitable.

I recently spend a few hours in-person with Penthera to learn how their platform uses dynamic ad insertion, so that after a video is downloaded onto a user’s device, the ads are regularly refreshed in the background (when the user isn’t using the app but is connected to cellular or Wifi) to ensure that ads served are up to date and monetized. The company says video ads are typically monetized if played within 3 to 4 days, on average. Based on the configured refresh window, Penthera receives a notification that tells their platform to update the ad in the background before the monetization expires, typically about 2 days before. This guarantees that downloaded ads are always timely and monetized when attached to a video stream.

Both providers and advertisers need accurate information about who is seeing ads, but tracking viewership becomes more complicated when users aren’t connected to a Wifi or cellular network. Penthera calls its solution agnostic, which means it can plug into existing ad systems such as FreeWheel, Google Ads, and SpotX. Because of this, the downloaded ads have the same targeting capabilities and analytics reports on ad performance as streaming ads. The ad server is still able to provide all the standard insights into the ad performance, and Penthera’s SDK provides additional analytics around when the downloaded content starts or stops, download progress, and whether the downloaded video was watched. This means, by integrating into existing advertising infrastructure, ad-supported video downloads can act as an extension of streaming ad campaigns.

The success of certain digital ad campaigns isn’t only measured in terms of impressions, but also by viewer click-throughs. Naturally, clicking through to an online landing page is impossible when the user isn’t connected to the internet. But Penthera has come up with an interesting work-around for this, so that offline ads can still promote engagement: their SDK is built to track click-heres. This means that, even when offline, a viewer can click on an ad. Later, once they are connected to the internet again, the viewer receives a notification reminding them that they were interested in the ad content, with a link directing them to the appropriate landing page, website, or App Store.

Penthera says their data shows that a large majority of users watch downloaded video while their device is online. This is an interesting insight, as it demonstrates that users are downloading content because they either value the experience they get from offline playback over streaming or they are generally watching content while on cellular networks in order to limit data plan usage. What this means for advertisers, however, is that in the many instances, playing downloaded video with ads functions much the same as if the user was online. The advertising beacons (the network calls that inform the ad networks that an ad was played) can be reported immediately when the impression occurs, just like existing beacons. However, if the user happens to be offline when they play their video, the technology steps in to support the process. Penthera’s SDK catches the beacons and records the exact moment they were triggered. Then, later when the device gets back online (even if the user doesn’t open the video app again), the beacons and the time they were recorded are sent to the advertising platforms to be recorded. Thus, all offline advertising impressions have a chance to be monetized.

In addition to beacon management, you also have to have to manage dynamic offline advertising loads. The value of advertisements typically diminishes rapidly from the point an ad is initially delivered. The ads with the highest value usually have a short validity window before they don’t pay out. An effective offline advertising platform needs to be able to balance the needs of the publisher to delivery high-value ads with the needs of the advertiser to have ads only display when they’re valid. Penthera’s says their solution allows advertisements, delivered via both server-side and client-side insertion methods, to be refreshed and updated over time, without requiring the video to be re-downloaded. They are also considering the ability to download multiple ad loads simultaneously (some with high value, but short expiry, and some with lower value, but longer expiry) to better insure that when the customer wants to play a video, there’s always the ability to include a monetizable ad.

I’ve written before about the importance of download as a feature within OTT services and how it may be a big business opportunity for AVOD providers. But this will only hold true if the technology can work seamlessly with existing apps and ad-based revenue models. From what I’ve learned about Penthera, it appears as though they’ve solved for some of the biggest challenges of taking AVOD offline. Now we just need more OTT services, both AVOD and SVOD, to start including downloads as an option, to enable additional monetization options.

Job Opening NYC – Solution Architect, Front End Development, Video Applications, $130K-$150K

There is an immediate opening for a Solution Architect, Front End Development, with one of the largest public M&E companies in the world($200B+ market cap). They have multiple live/VOD OTT offerings and will coming out with more. I am helping the person you will report to find the right candidate for this job, which is based in NYC (not negotiable) and pays $130k-$150k. This job is not currently listed online. I’ll also add, your boss is someone you would want to work for. I know them on a personal level and they have a very unique background. You would learn a lot from them and be given an opportunity to be amongst some extremely smart individuals. If you are interested in learning who the company is and more about the job, please email me or just give me a call anytime at 917-523-4562. Candidates are being interviewed immediately.

Job Description:​ Own the process of solving high impact, highly technical problems that span the purview of multiple organizations and stakeholders, where requirements and direction are often yet to be defined or discovered. This role is one part technical evangelist, and one part technical architect. Success in this role requires effectively working with various technical leaders from different organizations to design solutions that work for all parties involved, and to evangelize these solutions and ensure teams can execute effectively.

Preferred Qualifications

  • Able to bridge communication and technical knowledge between multiple engineering and product teams
  • Well organized with good written and verbal communication skills
  • Self-learner, independent, and ​easily adaptable
  • Architecting resilient applications that handle failure gracefully
  • RESTful web service development
  • Other Tools
    • API testing – PAW and/or Postman
    • Plantuml or other similar sequence diagram tool
    • Jira/Confluence
    • Github
    • Jenkins
  • Scripting Language – node/ruby/python/etc