Selectively Filtering Content in Web Browsers

Article
11/30/2010

Typically the job of a web browser is to download and display content-- establishing a network connection, sending HTTP requests, retrieving the web page, and downloading and running all of its content. These operations pose non-trivial challenges, and as such, web-browsers are among the most complicated software that most of us routinely use. However, there’s a whole separate (higher level!) challenge around selectively not running (filtering) content.

Today, different browsers offer many different mechanisms for selectively filtering content. This post is a survey of how these mechanisms work, and the subtle and sometimes not so subtle differences between them.

Examples and Motivations

Different users have shown an interest in myriad different types of Content Blocking, and not all users have similar goals.

Certain types of blockers are over a decade old and extremely commonly used (e.g. popup-blockers) while others are less often used or only of interest to a small niche audience. Just reading the comments on this blog, it’s clear that some users want to be able to block cookies, plugins or ActiveX controls, certain types of content (e.g. malware, adult content), privacy-impactful “trackers” (e.g. “web beacons”), advertisements, file downloads, or content they consider “annoying” (e.g. popups, flashing content). Individual consumers may have many different reasons for wanting to block particular content: faster performance, improved security, increased reliability and stability, enhanced privacy, increased battery life, preference about user-experience, legal or supervisory requirements (e.g. parental controls) lower bandwidth charges, as well as many others.

However, on the other end of the internet connection, a website provider may or may not want content blocked, for any of any number of reasons: revenue (direct or indirect), site analytics and understanding customers and markets, predictability and reliability of the user experience, malicious intent, and many others.

In some scenarios, site publishers and developers are just fine with content blocking and modification. For instance, a site owner whose legitimate site was compromised to serve malware probably wants that malware content blocked to keep his visitors safe until the site can be cleaned. Accessibility tools are crucial for some people to use the web and websites. Some sites and networks may offer users a way for to opt-out of analytics or other tracking.

With hundreds of millions of unique browser users, billions of webpages, and myriad different stakeholders in any given webpage visit, the complexity of this topic is clear.

This blog post offers an engineering point of view on the technical aspects of content blocking to help inform the broader discussion. There are many, many other points of view on these issues around the web in general and on this topic.

One trend that is clear is that over time we see demand for browsers to integrate functionality that used to be available via add-ons. Back in 2002, I released a moderately successful IE add-on popup-blocker. PopupPopper remained popular for a few years until Windows XP SP2’s IE6 included an integrated popup blocker.

Mechanisms for Blocking Content

Before we dive in, let’s distinguish between two different approaches for identifying content to block. A heuristic blocker relies on “rules” (heuristics) or other computer-generated information to determine what content to block. A curated blocker relies upon a human being (somewhere) to make that determination.

Heuristics can tell a cookie from an image from some web page content, but they can’t determine the intent or use of any of these. Was the “ps” cookie set so that I don’t have to remember my login, or is it a cookie used to track me… or both? Is an image on the webpage a photo of my brother’s new car, or is it an advertisement, or is it a web beacon used to track me? At best, heuristics can take a guess. The advantage of a heuristic blocker is that it can block things it hasn’t seen before as long as it is similar to other content the user is trying to block; the disadvantage is that heuristics can misfire.

In contrast, a curated blocker uses a person’s judgment to determine what to block. The advantage is improved accuracy, with the disadvantage of a lot of work to keep the list up-to-date with the new content available on the web.

With that distinction in place, let’s look at heuristic and curated blockers as they’re implemented today.

There are three major categories of content blockers: Network-level blockers, Browser reconfiguration/filtering, and Browser Content Blocking add-ons.

Each has its strengths and limitations, outlined in the sections below.

Blocking at the Network Level

There are several common ways to block content at the network level—the most common are by using the HOSTS file, or by filtering content with a proxy. There are a number of other, less-common network-level approaches, including using a router to block particular content (most Linksys routers can be configured to block Java, ActiveX installers, and cookies, for example). Large organizations or networks with restricted bandwidth, for instance, may block content at the gateway:

Blocking content at the network has a number of advantages, including the fact that it will work regardless of which browser you use, and does not require browser-specific add-ons. Network blocking also has a number of disadvantages, primarily because download requests do not contain much context about how the content being requested will be used.

The HOSTS File

Blocking via the HOSTS file works by altering how your computer maps web addresses to actual web sites. Specifically, requests for information to particular addresses (hostnames) like “www.example.com” are directed to other addresses, perhaps on your local computer (aka “127.0.0.1”). When your browser dutifully tries to download particular content, the browser attempts to retrieve it from a location that won’t respond. Because the request is effectively to “a wrong number” (to use a telephone analogy), it fails to return any web content and effectively blocks the content.

By definition, the HOSTS file is a curated block list that cannot easily be computer-generated. However, many organizations publish HOSTS files to block particular sites; one of the most popular such files is here.

There are a number of downsides to this approach, including:

It’s non-trivial for most users. Updating the HOSTS file on Vista and above requires elevating to admin and editing a hidden system file in the windows\system32\drivers\etc folder.
It’s non-granular. You cannot block a specific path, you can only block all files from a given hostname.
It doesn’t work if your connection to the internet goes through a proxy server (e.g. at a school or many businesses) . When you are behind a proxy, the proxy performs DNS lookups on your behalf, and the local HOSTS file is ignored.
Your machine configuration matters. If you happen to be running a web server locally (or have an unusual firewall configuration), performance may be impaired.
It’s trivially detectable. Because the browser will trigger an error (e.g. the OnError event) when content fails to load, JavaScript can detect the blockage and react.

Filtering with a Proxy

When you configure your browser to use a content-filtering proxy, the proxy can simply decline to return that specific content, or return a replacement file instead of the server’s content. Because the proxy sees the full URL of the content, a proxy can easily block a file named “tracker.js” on every server, or return a blank image for any requests for “/adult/*.gif.” This granularity can be useful when blocking content, because it allows for heuristics that work across multiple hostnames, and it allows blocking subsets of content from a given host in the case that a site serves both wanted and unwanted content.

There are a number of downsides to this approach, including:

Browser configuration. Browsers must be configured to use a proxy; most browsers adopt WinINET/IE’s proxy settings, but not all do.
Performance. Proxying traffic incurs some overhead, particularly because browsers limit the number of simultaneous connections to a proxy.
Most don’t work with HTTPS. Only proxies configured to decrypt HTTPS traffic can block individual files delivered over HTTPS. Other proxies can only entirely prevent HTTPS connections to target servers.

There are a number of content-filtering proxies available. Two of the most popular, the Internet JunkBuster Proxy and Proxomitron are no longer developed. Privoxy and others are still under development, and do-it-yourselfers can experiment with or build upon the trivial Content-Block extension for Fiddler.

Proxy-based blockers can use either heuristics or a curated block list, or even a combination of both strategies.

Blocking via Browser Configuration

One of the simplest ways to block unwanted content is to use existing features built-in to the browser. Most browsers offer the ability to disable certain features altogether, and some offer the ability to control certain features on a per-site basis. Internet Explorer offers the following features that allow the user to block unwanted content:

Zones Configuration
Per-Site ActiveX
InPrivate Filtering
Cookie Controls
InPrivate Browsing / Delete Browsing History
The Popup-Blocker

These features offer differing levels of control and granularity, and each blocks only certain types of content. Some of the features are based on heuristics (e.g. the contents of a P3P file or configuration setting) while others require the user or another curator to determine the desired policy.

Zones Configuration

Internet Explorer uses the concept of Security Zones to decide what privileges content from a given site may use. By adjusting the privileges on a per-zone basis, and by selecting which Zone a given site runs in, the user has a powerful level of control over what the site may do.

There are myriad possible configurations that may be used, but the simplest is to place sites from which to block content in the Restricted Sites Zone. Content from the Restricted Sites zone runs with very few permissions, and may not send or store cookies, serve script, run script from other sites, load ActiveX controls, or download files. For instance, if you place *.google-analytics.com in the Restricted Sites zone, script from that server may not run on any other page, and cookies are not sent or stored from that server.

The user-interface for adding sites to the Restricted Zone is simple to use, and can even be controlled by Group Policy.

However, using Zones to restrict unwanted content has a number of downsides:

Network Performance. Because content is still downloaded even if it is not used, using the Restricted Zone does not fully recoup the performance impact of unwanted content
Content is restricted, not necessarily blocked. For instance, script from a site in the Restricted Zone is not run, but images from that zone will still be shown. A sufficiently dedicated tracking technology could associate an HTTP request from the user event though no cookies are sent.
Scale is limited. The Restricted Zone UI and data structures were designed for a personal scale, not internet-scale lists of sites. Listing more than a few hundred sites will begin to slow down browser startup performance. For instance, one product automatically injects 10000 sites in the Restricted Zone, which significantly impacts Internet Explorer’s performance.

We commonly get requests to convert the Restricted Sites zone into a “blackhole zone” that has no permissions, including permission to make network requests. That won’t work, because Security-Restricted IFRAMES, for instance, use the Restricted Zone settings and they must be able to render content (or they become ineffective for what developers expect them to do). Another suggestion is to create a new Zone solely for content-blocking, but unfortunately this would be difficult because many programs and frameworks are hardcoded to the current set of Zones and behave very poorly if a new Zone appears.

Per-Site ActiveX Configuration

Users who wish to control Flash, Silverlight, Java, or other plug-ins in Internet Explorer can use the Per-Site ActiveX feature of IE8 to control which sites may display such content. By removing the * from the list of allowed sites for an add-on, the user will receive a prompt on any site that attempts to use the add-on. Unless the user adds the site to the approved list, the add-on will not run on that site.

There are a number of weaknesses to using Per-Site ActiveX to block content:

UI Annoyance – Because the Information Bar lacks a “Never for this site” option, users will always see this prompt on any site for which the add-on is not approved
Manageability – The user-interface for the Allow List is buried deep within Manage Add-ons, and also does not offer a “Never” option. The UI does not allow users to remove individual sites from the Allow list; the only option is to clear the entire list.

Nevertheless, this can be a powerful feature for restricting which sites may use a given browser add-on with more granularity than the legacy “Enable / Disable” option that controls whether an ActiveX control could run at all.

InPrivate Filtering

Internet Explorer’s InPrivate Filtering feature detects and optionally blocks 3^rd party content to help consumers exercise control over their browsing information. The feature can operate in either a heuristic or curated mode at the user’s choice.

Third-party content that appears across multiple sites is presented for the user’s review, and the user may choose to allow or block such content. When the user chooses to block that content, Internet Explorer will no longer make requests to the target URL when it is third-party content on the page. Unless the user visits that URL directly, IE will not visit that URL to get content from it. This is an effective mitigation for people concerned about the risk of sharing information with potential tracking sites. By default, InPrivate Filtering is off, and users need to explicitly choose to turn it on each time they run the browser.

InPrivate Filtering offers the ability to import lists of sites to block, which appears to be a popular feature in some circles.

The user may further configure InPrivate Filtering to “Automatically block” content from a 3^rd party context when a certain threshold number of uses is reached. So, for instance, the user can configure IE to block third-party content which is used by 5 or more unrelated sites.

The downsides of the heruistic or “Automatically block” mechanism are clear:

No way to determine intent. There’s no way for a browser to know whether or not a given piece of content (e.g. a “share this link!” widget) is being used to track you, or is used for a harmless purpose. Hence, content may be blocked as a “possible tracker” unnecessarily.
Shared script repositories break. Big companies like Microsoft and Google host popular JavaScript libraries like JQuery on fast CDN servers and invite other websites to reuse those libraries. This is great for performance (because the user is likely to have the library stored in their cache) but is indistinguishable from a tracker. If a needed script is automatically blocked, the page which relies upon it will break.

The InPrivate Filtering feature is controlled on a per-Zone basis, when opted-in the filtering is applied to the Internet and Restricted Zones only.

As I blogged back in June, Internet Explorer offers an extremely rich set of controls for cookies that allow users to specify simple options like “Block all cookies from example.com” to advanced options like “Discard all 3^rd party cookies at the end of the browser session.” Cookies are also controlled by Zone privileges. By default, cookies are unconditionally permitted in the Local Computer and Intranet zones, subjected to cookie controls in the Internet and Trusted Zones, and blocked entirely in the Restricted Zone.

InPrivate Browsing / Delete Browsing History

The InPrivate Browsing and Delete Browser History features can be used to prevent storage of unwanted cookies or other information. The new “Delete browsing history on exit” checkbox in IE8 allows the user to delete unwanted content at the end of each browser session. The “Preserve Favorites website data” option allows preservation of desired content while wiping everything else.

Internet Explorer 6 on Windows XP SP2 introduced the popup blocker. The popup blocker includes a number of configuration options which can be found inside the Tools > Internet Options > Privacy settings dialog. The configuration settings for the Popup Blocker are stored inside the HKCU\Software\Microsoft\Internet Explorer\New Windows\ registry key, including the list of sites that are permitted to launch popup windows. The Popup Blocker is enabled on a per-Zone basis, and applies (by default) to pages in the Internet, Trusted, and Restricted Zones.

Blocking via Add-ons

In some cases, enthusiasts are not satisfied with the options provided by built-in browser controls and have built a variety of different add-ons to block content. Some add-ons automatically block content (using either heuristics or a curated block list) before it is downloaded, while other add-ons simply remove unwanted content after it has been loaded.

Automatic Blocking

Automatic blockers tend to inject themselves into the browser’s download subsystem and watch for requests to unwanted content; such requests are then terminated or a locally-generated placeholder is returned. Less commonly, such add-ons will scan the DOM of the currently loaded document and remove content which matches some pattern (e.g. images within a DIV named “adultcontent”).

Content blocking add-ons for IE include: Simple Ad-Block, IE7Pro, AdBlockIE, Adblock Pro, as well as many others. The downside of these add-ons is the downside of add-ons across all browsers: performance and reliability. Many of these add-ons use mechanisms that do not follow the IE Add-on Guidelines and Requirements and depend upon unsupported and fragile “thunking” of private browser APIs.

Manual Blocking

In contrast to the Automatic Blockers, Manual Blockers allow the user to remove unwanted content from the current page after it has been loaded. These blockers are often simpler to develop although their capabilities are usually limited—often they simply serve as a streamlined user-interface that configures existing browser features.

Toggle Flash and dozens of variants simply toggles the “Enabled/Disabled” setting for the Flash object.
You can configure a script for Ralph Hare’s Mouse Gestures add-on such that waggling the mouse will remove images and ActiveX controls from the current page.
Similar scripts exist as Context-Menu extensions or bookmarklets.

The advantage of Manual Blockers is that they typically only do work when invoked, and thus tend to be faster and more reliable. The disadvantage is that, because they tend to run after the content is loaded, users still pay the penalty of initially downloading the unwanted content.

Evaluation of Blocking Mechanisms

Each of the blocking mechanisms listed above has one or more downsides.

Perhaps the biggest risk is to the user’s experience when interacting with a site whose content is blocked. Commonly, browsing enthusiasts may configure a blocking mechanism and generally enjoy its benefits, but then later waste a great deal of time trying to figure out why some site they care about isn’t working correctly. In some cases, the user experiences a “doh! ” moment, guesses that blocked content may be causing a problem, and subsequently adjusts the blocking mechanism to fix the site. In other cases, the user may never suspect that content-blocking has caused a problem, and the resultant breakage may lead the user to abandon using the site or the web browser thinking that one or the other is “buggy.”

Carefully curated blocking mechanisms are somewhat less likely to cause site-compatibility problems than heuristic approaches because software cannot readily determine intent in the same way that a human can. Curated lists can be burdensome for the author and user to maintain.

As websites evolve, both curated and heuristic blocking mechanisms may become less effective.

Content Blocking and Site Evolution: A Case Study

Let’s look at how sites and one particular mechanism for content blocking (popup blockers) evolved over time in practice on the web. The history of popups and popup-blockers is a great case to study, because while popups are now somewhat rare, they used to be everywhere.

Back in the early days of popup blockers, I once visited a small tech news site. After a popup from the site was blocked, I saw the following alert:

At the time, I didn’t think much of this warning. Thirty seconds later, the site attempted to show another popup, and upon failure, it embarked on a primitive denial of service attack. An endless stream of alert dialogs was presented, preventing further use of the browser window:

Now, this sort of site response certainly isn’t common today, but nevertheless many sites will detect when content they insist upon delivering is not delivered. The site may have any of a number of reasons – perhaps artistic integrity of their content or a contractual obligation. As with most issues on the web, there are many points of view: consumer, site, security, accessibility, IT, and more.

This gets back to the point above about what software can recognize: data types and patterns, not intent. So, if you as a user configure blocking of content through one of the above mechanisms, understand that sites, historically, have responded, and we are still living with back and forth on something as old as popup blockers. Here’s another example.

Sometimes you want a click on a web page to result in a new window. For example, when you click on the “Reply” button in your web mail application, you may expect that to open a new window to allow you to compose a message. However, some sites use (or co-opt) that click in order to launch a popup, bypassing the popup blocker.

For instance, if you visit the online Dilbert comic today, you will likely see a notification that a popup has been blocked. If you subsequently click anywhere in that page, a JavaScript file delivered from casalemedia.com reacts to that click by spawning an advertisement delivered by a content delivery network. This popup is not blocked because the click is a User-Initiated Action, which temporarily disables the popup blocker by default. While the user may be able to block the advertisement by taking note of the hostname in the popup (optmd.com), blocking that site will only block the content of the popup, and not the popup itself. Only by examining network traffic can a savvy user determine which site to block in order to prevent the popup-blocker-circumvention JavaScript from running. While IE’s popup blocker can be configured to block (actually, not exempt) popups that are a result of a User-Initiated Action (Tools > Options > Privacy > Settings > Blocking Level: High) this setting makes it much more cumbersome to use sites that rely upon this mechanism for popups the user actually solicited.

Similarly, some sites have responded to content blockers that focus on advertisements. For instance, one of the top five web mail sites will detect if an advertisement has been blocked, and if so, it will simply try loading a different advertisement from a different ad-server, rotating between five or more different advertising providers hoping to find an unblocked host. Similarly, one of the most popular online advertisers will detect when in-page advertisements have been blocked, and rather than taking the user to the next page when reading a multi-page story, the site will instead present a full-page interstitial advertisement with a count-down timer. While this advertisement too can be blocked, the page itself typically is not, leading to a degraded user-experience. Some smaller sites were so incensed by the use of ad-blockers in Firefox that they simply banned all Firefox users, redirecting to a lengthy tirade. In a recent development, one firm now delivers ads as a part of a CAPTCHA test—blocking the ad means you cannot use the site.

The update cycle, back and forth, between browsers and sites can take many years. For instance, only relatively recently did the Outlook Web Access team introduce a version that prevents popup-blockers from breaking their user-experience, and many other web applications have yet to make such updates. This is particularly troublesome because unblocking popups causes a page refresh to occur (any JavaScript that was trying to manipulate the popup would have failed to run when it was blocked, so a refresh is required to ensure the script runs properly). Refreshing the pages of many web applications in this way causes them to lose important state information.

Unfortunately, some sites will likely evolve to circumvent blockers against the user’s preferences, while other web applications may not bother to detect or mitigate content blocking, leading to an impaired user-experience.

-Eric Lawrence