January 2015

Volume 30 Number 1

Big Data - Protect Your Online Services with Big Data and Machine Learning

By Alisson Sol,,Don Ankney

There are currently several methods for protecting online services, from the Security Development Lifecycle to operational processes for rapid incident response. One of the primary assets of online services is usually overlooked—the Big Data created by request logs and operational event monitoring. This article will explore usage data processing and machine learning (ML) techniques to improve security, based on the experiences of protecting online assets in the Microsoft Applications & Services Group (ASG), including services such as Bing, Bing Ads and MSN.

Most online services create several streams of logged data. While there’s no standard taxonomy for the kinds of measurements you can store about a service, when you’re exploring that data seeking security issues, you can broadly categorize it as usage data or operational data. Usage data includes any logged value regarding use of the service by its target audience. A common example is a log entry for requests made to a Web site:

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2014-10-27 20:46:57 GET /search q=election+results&form=PRUSEN&mkt=en-us 80 - Mozilla/5.0+(Windows+NT+6.4;+WOW64;+Trident/7.0;+Touch;+rv:11.0)+like+Gecko - 200 0 0 5265

This type of log entry contains data not only about the requested resource, but also the client browser, return code and time taken to complete the request. More sophisticated services may enrich usage data with derived information such as geolocation or application-specific information like user identification (for logged-in users). There would be no usage data without actual users, except perhaps for testing and monitoring agents.

Operational data refers to server and service operational measurements. This includes CPU utilization or temperature, disk space, network transfer rate, application exceptions, memory faults, and similar factors logged as soon as a server is turned on and a service started. In modern datacenters, log information typically includes not only computing devices, but also aspects such as air-conditioning measurements, presence of personnel and visitors in zones containing sensitive data, doors being opened and closed, and similar information required by operational security standards.

The code samples in this article will focus on processing usage data. However, you could apply most of the principles outlined and demonstrated here to identify vulnerabilities using operational data. You can also improve your chances of identifying security incidents by correlating usage data with operational data.

Attacking Endpoints

The pace of changes for a large online service makes it hard to protect using only typical Security Development Lifecycle practices, such as code reviews and static analysis tools. Thousands of changes are committed every month. And there are often at least a few hundred experiments “in flight” at any given point. These experiments present new features to select users to gather feedback before widespread release. Besides following good development practices and having penetration test teams constantly trying to pinpoint vulnerabilities, it’s important to automate as much of the vulnerability discovery as possible.

Near the end of 2014, Microsoft Bing services generated hundreds of terabytes of usage data on a daily basis, logging up to hundreds of billions of requests. It’s safe to assume some of these requests were actually attacks, trying to identify or exploit vulnerabilities. A typical query to Bing is made sending to the service a URL request:


In this example, the user is searching for “election results.” There are two other parameters in the URL that identify the Web form that originated the request and the market setting (in this case, indicating the user language is English and in the United States). You may understand this as a call to the “search” application within the Bing domain, with parameters q, form and mkt. All such requests are expressed in a canonical format, like so:


There are other applications within the Bing domain answering similar requests. A request asking for “election results” in the video format would be:


As online services grow, new applications and features are added dynamically—some for convenience and others for compatibility. Different formats are often allowed for the same request. Bing videos would also accept this request:


Assuming you could derive from the usage logs a list with all canonical requests to a service, you could then probe for vulnerabilities trying to inject known malicious payloads into the parameter values. For example, an intruder could use the following request to verify if the Bing video application is vulnerable to cross-site scripting (XSS) in the query parameter:


An intruder scanning for vulnerabilities will also test the responses for malicious payloads injected into other parameters, in all possible combinations. Once a vulnerability is found, an attack can be launched. Attack URLs are usually included in spam messages, in hopes a small percentage of users will carelessly click on the links. Some of those users may even suspect URLs containing JavaScript keywords. However, encoding the requests makes it more difficult to promptly identify attacks:


You can write an application that accepts the list of canonical requests to a service as input, injecting malicious payloads for each kind of possible vulnerability and detecting from the service answer if the attack succeeded. You can find the code for individual “detectors” for several kinds of vulnerabilities (XSS, SQL injection and open redirects) online. In this article, we’ll focus on finding the “attack surface” for online services from the usage data logs.

Processing Environment

Weblogs are usually distributed across several machines, making sequential log file reads extremely efficient (even better if the files are partitioned across different storage devices in a distributed file system). That makes processing Weblogs a great application for the MapReduce framework.

For this example, we’ll place Weblogs in Microsoft Azure Blobs under the same container called InputContainer. As a processing platform, we’ll use Azure HDInsight Streaming MapReduce jobs. There’s good information already available online on how to set up and configure HDInsight clusters. The code explained in this article will generate binaries you should place in a container accessible to the HDInsight cluster, referred to as ClusterBinariesContainer. As code executes and processes input, it will create output in another container called the ClusterOutputContainer, along with status information saved to the ClusterStatusContainer. A visualization of the Azure HDInsight processing configuration is shown in Figure 1.

The Azure HDInsight Processing Environment
Figure 1 The Azure HDInsight Processing Environment

You need to replace the placeholder names in Figure 1 with values for your specific configuration. You can set these in a configuration file. The Windows PowerShell script that will create and execute the HDInsight job will read the XML configuration file shown in Figure 2. After configuring the file, you’ll most likely execute the script for usage data analysis from within an Azure PowerShell prompt, having properly configured your Azure account to have the authorization to access the storage and compute services (see Get-AzureAccount and Add-AzureAccount cmdlets help).

Figure 2 Windows PowerShell Script XML Configuration File

<?xml version="1.0" encoding="utf-8"?>

Map to Canonical Requests

Getting the attack surface for the online services using the MapReduce processing environment consists of creating a mapper application to extract the URLs from the Weblogs and transform them into canonical form. That value becomes the key for the reducer, which will then eliminate duplicates. That’s the same principle used in the sample word count application available for HDInsight. Removing any comments and validation code, the following code demonstrates the main entry point for the mapper application:

public static void Main(string[] args)
  Console.SetIn(new StreamReader(args[0]));
  string inputLogLine;
  while ((inputLogLine = Console.ReadLine()) != null)
    string outputKeyAndValues =

This code goes through every input line and extracts the unique key, as well as any complementary values relevant to the problem being solved. For example, if you were seeking the most common user queries, the key would be the value passed for the query parameter. Raw log lines appear as follows:

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2014-10-27 20:46:57 GET /search q=election+results&form=PRUSEN&mkt=en-us 80 - Mozilla/5.0+(Windows+NT+6.4;+WOW64;+Trident/7.0;+Touch;+rv:11.0)+like+Gecko - 200 0 0 5265

Columns cs-uri-stem and cs-uri-query have the relevant information you need to have parsed to get the canonical form of the request (the sample code doesn’t include multiple hosting pro­cessing). The function to extract key and values from each log line is outlined in Figure 3.

Figure 3 Function to Extract Key and Values from Log Line

private static string ExtractKeyAndValuesFromLogLine(string inputLogLine)
  StringBuilder keyAndValues = new StringBuilder();
  string[] inputColumns = inputLogLine.Split(DataFormat.MapperInputColumnSeparator);
  string uriReference = inputColumns[DataFormat.MapperInputUriReferenceColumn];
  string uriQuery = inputColumns[DataFormat.MapperInputUriQueryColumn];
  string parameterNames = ExtractParameterNamesFromQuery(uriQuery);
  // Key = uriReference + separator + parameterNames
  return keyAndValues.ToString();

The only missing logic relates to extracting the parameter names from the query column. Code to perform the task is shown in Figure 4. The input for that function—the previously provided sample line—would be a string like this:


Figure 4 Function to Get Just Parameter Names from Query

private static string ExtractParameterNamesFromQuery(string query)
  StringBuilder sb = new StringBuilder();
  // Go through each parameter adding to output string
  string[] nameValuePairs = query.Split(DataFormat.ParametersSeparator);
  Array.Sort(nameValuePairs, StringComparer.InvariantCultureIgnoreCase);
  List<string> uniqueParameterNames = new List<string>();
  foreach (string nameValuePair in nameValuePairs)
    int indexOfSeparatorParameterNameFromValue =
    string paramName = nameValuePair;
    paramName = nameValuePair.Substring(0,
  return sb.ToString();

The canonical form used in the sample code will remove the parameter values, sort the parameter names and transform this into a still valid query string:


Sorting the parameter names helps avoid duplication, because Web requests don’t depend on parameter order. The placeholder used for the parameter value is “1,” instead of “[]” because it’s shorter. It may also be used for other things like counting the number of times parameters appear in all combinations of request parameters, as shown in Figure 4.

Reduce the Attack Surface

The mapping code sequentially reads the Weblog lines, then outputs a key and value for each. MapReduce has a “combine” phase, which assembles all records with the same key for processing by the reducer code. If an input log had several lines doing search queries, by now those produce identical output:

search?form=1&mkt=1&q=1&         1
search?form=1&mkt=1&q=1&         1
search?form=1&mkt=1&q=1&         1

Figure 5 has the outline for the reducer code. It reads input lines and splits those into key and values. It keeps a counter until the key changes, and then outputs the result.

Figure 5 Reducer Main Loop

public static void Main(string[] args)
  string currentKey, previousKey = null;
  int count = 0;
  Console.SetIn(new StreamReader(args[0]));
  string inputLine;
  while ((inputLine = Console.ReadLine()) != null)
    string[] keyValuePair =
    currentKey = keyValuePair[0];
    if (currentKey != previousKey)
        previousKey, count);
      count = 1;
      previousKey = currentKey;
    else count++;
    previousKey, count);

You could easily modify the code and script provided in this article for other purposes. Change function ExtractKeyAndValuesFromLogLine to have the parameter values as the keys and you’d have a useful distribution of value frequency. In its current form, the output will be a list with the attack surface, showing the normalized application path and frequency of requests:

search?form=1&mkt=1&q=1&         3
video?form=1&mkt=1&q=1&           10

Understand Service Traffic

The attack surface will already be valuable to help you understand what’s happening with your service, even if you don’t perform proactive penetration testing to expose vulnerabilities. Although request frequency changes per hour, per day of the week, and with other seasonal factors, the normal behavior over time changes with feature releases or even coordinated attacks. For example, Bing receives hundreds of billions requests per day. The attack surface list typically has hundreds of thousands of values per day. Not all such paths are even expected. Figure 6 summarizes what’s found on the attack surface on a typical day. The first row indicates the top 10 canonical paths amount to 89.8 percent of the usual request traffic. The next top 10 paths add up to 6.3 percent of the request count (or 96.1 percent for the top 20). Those are really the top applications for the service. Everything else amounts to less than 4 percent. Such a request pattern is very different for a site with syndicated content, like MSN.com.

Figure 6 Typical Distribution of Request Paths for Bing in 2013

Canonical Paths Percentage
Top 10 89.8
Top 20 96.1
Paths with <= 1,000 requests 99.9
Paths with <= 100 requests 99.6
Paths with <= 10 requests 97.6
Paths with = 1 request 67.5

It’s particularly interesting to note that about two-thirds of the requests are to unique paths. Some of those requests may come from attackers trying to probe for application parameters that might trigger certain functionality. Yet the very nature of online services generates a lot of such traffic. Links to your service stored a few years back may still be activated by humans or automated processes. While seeking an attacker, you may uncover the need to have a compatibility mode for old URLs, and automatically redirect those to new versions of the application. That’s a good business result.

Developing the attacking application is a journey you should take with care. Assuming your detectors are all perfect, you’ll now impose a load onto your service that needs to be properly throttled. You need to avoid creating a denial-of-service attack or affect performance for real users. It’s also essential to avoid numerous false positives. If that happens, incident reports from the attacking service will soon be ignored.

Learn from Service Data

ML lets you automate several processes that would be difficult to do by directly coding instructions or rules. For example, it would be hard to conceive the code for a computer vision application that would detect a person is in front of a camera. After labeling thousands of images with depth information, however, the Kinect team at Microsoft was able to “train” an ML module to do this with sufficient accuracy. The labeled depth images indicating not only human presence, but also body position, enabled the learning process.

An attack service that generates requests of known categories (XSS, SQL injection and so on) automates an important part of the process to use ML methods for evaluating online traffic. It generates a large body of synthetic ground truth. Check the usage logs to easily identify all such attack requests known to be made at a certain time by the attack service. They’re now mixed with user requests, for which there isn’t yet a known classification (normal or malicious request).

To create a stratification of the data that truly matches the traffic hitting the service, you have to understand the nature of that traffic. Generating an experimental sample with 0.1 percent of the usage data for a service receiving 100 billion requests a day still results in 100 million requests. Only synthetic data can help create such an initial ground truth.

Assuming you have high-quality ground truth and adequate tools, the iterative cycle for the learning process for an ML solution to evaluate user requests is outlined in Figure 7. Starting with synthetic data in the ground truth, you can make experiments and kick off a training process in the ML module to classify requests (into categories such as normal, XSS, SQL injection and so on) or make a regression (indicating the confidence a request belongs to one or more of the categories). You could then deploy this ML module as part of a solution and start receiving evaluation requests. The output is then subject to a scoring process, which will indicate whether the ML module correctly identified the requests (true positives and true negatives), missed suspicious requests (false negatives) or generated a false alert (false positive).

Learning Cycle to Create a Machine Learning-Based Solution
Figure 7 Learning Cycle to Create a Machine Learning-Based Solution

If the initial experiments produced a good enough ML module based on synthetic data, that module should be fairly accurate with few incorrectly evaluated actual user requests. You can then properly label those that were wrongly evaluated and add them to the ground truth. A few more experiments and training should now generate a new ML module with restored accuracy. As you carefully iterate this process, the initial synthetic data becomes a smaller part of the ground truth used in the training process, and iterations of the ML module get better at accurately evaluating user requests. For additional validation, you can use the ML module for offline applications examining usage logs and identifying malicious requests. After sufficient development, you can deploy the ML module online to evaluate requests in real time and prevent attacks from ever hitting back-end applications.

Wrapping Up

While you should continue to follow solid development processes (including the Security Development Lifecycle), you should also assume your online service may be under attack at any point. Usage logs can provide you with insightful information about how such attacks are occurring. Knowing your attack surface will help you proactively attack your service to identify and close vulnerabilities before they’re exploited. Building that attack service then creates synthetic ground truth, enabling the use of ML techniques to train an ML module to evaluate service requests. Building the attack service is not a trivial task, but the immediate and long-term business results more than justify the investment.

Alisson Sol is a principal architect for Microsoft. He has many years of software development experience, with a focus on image processing, computer vision, ERP and business intelligence, Big Data, machine learning and distributed systems. Prior to starting at Microsoft in 2000, he cofounded three software companies, published several technical papers and recorded several patent applications. Read his blog at AlissonSol.com/blog.

Don Ankney is a senior security researcher in the Microsoft Information Platform Group where he works on applying Microsoft machine learning investments to the services security space. He was a founding member of Black Lodge Research, an educational non-profit focused on security, and regularly teaches secure development techniques at regional meet-ups, workshops and un-conferences.

Eugene Bobukh is a senior security/data program manager at Microsoft. Focusing on applying scientific approach to security problems, he conducted security testing and performed security reviews for more than 200 Microsoft releases since 2000, including .NET, Silverlight, and Internet Explorer. Some of Bobukh’s work is described at blogs.msdn.com/b/eugene_bobukh.

Thanks to the following Microsoft technical experts for reviewing this article: Barry Markey and Viresh Ramdatmisier