Automated Penetration Testing with White-Box Fuzzing

John Neystadt

Microsoft Corporation

February 2008

Summary: This article covers how to employ a white-box testing approach to fuzzing, leveraging available source or disassembled code of the tested software. (9 printed pages)

Contents

Overview

Introduction

Defect Types Found with Fuzzing

Fuzzing Taxonomy

Fuzzing Targets

Fuzzing Process

Delivering the Fuzzed Buffer to the Tested Software

Product-Testability Design for Fuzzing

Case Study

Length of Test

Measuring Fuzzing Coverage

Reproducing Fuzzing Failures

Integrating Fuzzing into the Software-Release Life Cycle

Questions for Further Study

References

Glossary

Overview

White-box fuzzing or smart fuzzing is a systematic methodology that is used to find buffer overruns (remote code execution); unhandled exceptions, read access violations (AVs), and thread hangs (permanent denial-of-service); leaks and memory spikes (temporary denial-of-service); and so forth.

You can perform fuzzing on any code that parses input that is received across a trust boundary. This includes files, network sockets, pipes, remote procedure call (RPC) interfaces, driver IOCTLs, ActiveX objects, and message queues (including Microsoft Windows messages).

This article presents a case study of fuzzing during development of Microsoft Internet Security and Acceleration (ISA) Server 2006, and discusses efforts, bug density, and ROI. During this release, the internal testing team found over 30 bugs that were either Important or Critical—according to Microsoft Security Response Center (MSRC) ranking—in over 500 KLOC parsing code.

Introduction

From: Barton Miller
Sent: Wednesday, February 16, 2005 4:39 AM
To: Shawn Hernan
Subject: Re: origins of the term "fuzz"

Thanks for the note. As far as I know, I coined the term.

The original work was inspired by being logged on to a modem during a storm with lots of line noise. And the line noise was generating junk characters that seemingly was causing programs to crash. The noise suggested the term "fuzz".

--bart miller

By monitoring hacker conferences and forums in the years 2005–2006, one can see that security researchers and hackers are increasingly using fuzzing as one of the main techniques for finding vulnerabilities. Hackers typically practice black-box fuzzing—generating various permutations of the data, without actually correlating it with the code that parses the data.

Black-box fuzzing ([1], [2], [7]) focuses on input format—ignoring the tested software target. While being efficient and allowing reuse of the same test tools across different tested targets that share the same data formats, this method misses significant code paths. These code paths depend on configuration options or specific complex conditions that are governed by application logic.

In this article, we discuss how to employ a white-box testing approach to fuzzing—leveraging available source or disassembled code of the tested software.

While white-box fuzzing requires a greater testing effort than black-box fuzzing, it provides better testing coverage—achieving higher tested-software quality, and ultimately completely eradicating defects that fuzzing can find.

Defect Types Found with Fuzzing

Fuzzing is a systematic way to find defects in code that account for the most severe of security bugs. Fuzzing is capable of finding remote code execution (buffer overruns), permanent denial-of-service (unhandled exceptions, read AVs, thread hangs), and temporary denial-of-service (leaks, memory spikes).

Fuzzing finds not only defects in buffer boundaries validation, but also faults in state machine logic, error handling, and clean-up code.

Fuzzing Taxonomy

Term Definition

Dumb fuzzing

Corruption of data packets randomly without awareness of data structure.

Smart fuzzing

Corruption of data packets with awareness of the data structure, such as encodings (for example, base-64 encoding) and relations (checksums, bits indicating the presence of some fields, fields indicating offsets or lengths of other fields).

Black-box fuzzing

Sending of malformed data without actual verification of which code paths were hit and which were not.

White-box fuzzing

Sending of malformed data with verification that all target code paths were hit—modifying software configuration and the fuzzed data to traverse all data validations in the tested code.

Generation

Generation of fuzzed data automatically—not basing on any previous input.

Mutation

Corruption of valid data according to defect patterns, to produce fuzzed data.

Mutation template

Well-formed buffer that represents an equivalence class of the input. The fuzzer takes the mutation template as an input—producing a fuzzed buffer to be sent to the tested software.

Code coverage

Technology (such as that which is bundled in Microsoft Visual Studio 2005) that allows inspection of which code paths were executed during testing. This is useful for verification of test effectiveness and improvement of test coverage.

A case study that was conducted on a small Web application (450 lines of code) with four planted defects showed the following number of defects that can be found by using different techniques. The fuzzing was done by using technology that is described later in this article.

Technique Effort Code coverage Defects found

Combination of black box + dumb

10 min

50%

25%

Combination of white box + dumb

30 min

80%

50%

Combination of black box + smart

2 hr

80%

50%

Combination of white box + smart

2.5 hr

99%

100%

Fuzzing Targets

Trust Boundaries

Fuzzing is a verification method for the code that processes input that is received across trust boundaries. Typical trust boundaries include:

  • Files that are received from the Internet (that is, downloaded from the Web, and mail or news attachments).
  • Network sockets.
  • Pipes.
  • RPC interfaces.
  • Driver IOCTLs.
  • ActiveX objects.

However, fuzzing can be applied to other, less typical trust boundaries—for example:

  • Structured data that is stored in a database as blobs (for example, XML) and is written and read by different users.
  • Configuration files that are written by one user and read by another.
  • Message-queue systems (persistent in database or Windows Messages).
  • Shared memory that is used to pass structured information between processes.

Primary vs. Secondary Fuzzing Targets

In a typical componentized software design, code that parses input data places the results in internal data structures or classes and is fairly separated from the logic that then consumes these structures. The parsing code is a primary target for the fuzzing; any unverified assumption about the data structure in it would cause unexpected results. However, the code that further consumes parsed data might also have assumptions about the data that was parsed, and applying fuzzing on this secondary code might also find security defects.

Therefore, during template selection and code coverage, analysis of secondary code should be accounted for, too.

Should Managed Code Be Fuzzed?

Naturally, most of the defects that fuzzing can find occur in software that is developed by using C or C++ languages, which leave memory management to the programmer. However, fuzzing can find defects in software that is coded in languages that hide memory management, including C#, Visual Basic, or Java. Bugs that can be found in managed code are unhandled exceptions, deadlocks, or memory spikes. Typically, these bugs are denial-of-service (DoS) or information-disclosure class bugs.

Fuzzing Process

There are two primary ways to produce a fuzzed data buffer:

  • Automatic generation of fuzzed data
  • Mutation of a sample of valid data (mutation template), obtained from a capture or created by some other test automation

Generation can be split into two steps:

  1. Generation of a valid mutation template
  2. Mutation of the template to produce a fuzzed buffer

Generation vs. Mutation

While it has been said that monkeys typing randomly can produce Hamlet entirely by chance [4], the number of permutations for accomplishing this is NP complete. Such complexity is feasible for simple parsers (see the following Step 1); for complex parsers, however, fuzzers will have to run for an impractically long time. This amount time is necessary for getting deep inside complex parsing code, to create sufficiently valid data to pass the various validations and checks that trigger parsing errors and, thus, prevent data from reaching the inner components of parsers.

A much more efficient technique for creating good fuzzing coverage is to produce fuzzing buffers by mutating well-formed buffers—mutation templates. Each such template is an equivalence class of the input.

For example, consider code that contains the following statement:

if (packet->Referrer && strcmp (packet->Referrer, sMySite) == 0 && packet->UserAgent && packet->UserAgent == uaIE6 && p->WwwAuthenticate && p->WwwAuthenticate != NULL)
{
if (packet->AcceptLanguage && strstr (packet->AcceptLanguage, "en-us") == 0)
      {
                  printf ("%s", packet->AcceptCharset);
      }
}

The flaw in this code is the assumption that the AcceptCharset header is always present in the request, if all of the other conditions hold. To find such a flaw, an HTTP request that contains all four headers—Referrer, UserAgent, WwwAuthenticate, and AcceptLanguage, and without the AcceptCharset header—should be attempted. Typically, when AcceptCharset is dereferenced in the call to printf, this code will crash.

If our fuzzing engine is able to add or delete random headers from the request, it will take a long time (if it is at all possible) to generate an HTTP request that has exactly those four headers present. However, if we provide a template buffer with all of the possible HTTP headers present, the fuzzer will very quickly create the bogus request with all of the headers, but without the AcceptCharset header present.

Selecting or Creating the Optimal Set of Mutation Templates

Cc162782.f4550f66-a328-4d4b-8f37-1386183654ca(en-us,MSDN.10).gif

Figure 1. Fuzzing-process details

Selection of the mutation templates can be performed by using the same methodology by which equivalence classes for test matrixes are determined. This equivalence-class determination can be done manually for simple parsers.

A few automation methods to create templates for fuzzing of complex parsers are proposed:

  1. Automatic template generation, using model-based or pairwise testing. You can define all of the parameters and their relationships that affect the parser, and use a modeling [10] tool or pairwise-testing tool, such as PICT [5], to generate sets of parameters for each template. Then, develop a template-generation tool to create a valid data buffer, according to the parameters that are contained in the model.
  2. Select an optimal set of templates captured from a production environment, based on the code-coverage difference. You can select a large number of sample templates (that is, capture 100,000 HTTP requests by using a network sniffer or 100,000 files by using a Web crawler). Run the target code with each template as input, recording code-coverage information. Then, sort the templates according to the marginal coverage that they add, and pick up only those templates that add substantial coverage (at least three percent).
  3. Leverage existing test automation, which already accounts for test matrixes. Then, invoke the fuzzing engine on data that is produced by the test, before delivering it to the tested software.

Fuzzing Engine

One systematic way to produce fuzzed data is by developing a fuzzing mutation engine. The mutation engine should implement the following requirements:

  1. Analyze the mutation template, and divide it into a memory tree of groups of fields and individual fields.
  2. Randomly select one of the fields or groups, and fuzz it individually in memory.
  3. Serialize the memory tree back into a fuzzed buffer.

Cc162782.ff1e1306-623c-4725-9bf2-ba4608fd527a(en-us,MSDN.10).gif

Figure 2. Fuzzing mutation engine

Analyzing Mutation Template

In order to analyze and parse a mutation template, the fuzzing engine must have knowledge about the data that it represents. The data description can be done either programmatically or by using a data-definition language, such as a schema. The schema might include recursive definitions of groups, elements and their relations, and layering (for example, of field encodings, such as URL encoding or base-64 encoding)—providing information on how to parse each field.

Fuzzing Fields

To fuzz an individual field, the fuzzing engine has to implement a library of fuzzers for each atomic data type. The fuzzers should be implemented to produce values of equivalence classes for each basic data type, which are likely to trigger typical code-defect patterns. For example:

  • For an integer field, a fuzzer can assign boundary values (for example, min_int, max_int, 0, max_int/2, max_int/4, min_int +1, max_int -1).
  • For string values, a fuzzer can inject null characters or format specifiers (for example, “%s”).
  • For a binary field, a fuzzer can flip bits randomly.

Fuzzing Groups

Group fuzzing can be done through manipulation of the fields that a group contains—once again, by following defect patterns. Examples are:

  • Addition of 1,000 fields.
  • Removal of some or all fields.
  • Duplication of existing fields.

Relations

In many data formats, multiple fields in the data buffer are related. One field often governs one or more other fields. For example, Web browsers use the Content-Length header to determine the length of the HTTP message body that follows. If a fuzzer produces an HTTP message body whose length is different from the mutation template, the Content-Length header will not be consistent, and fuzzed data will not be interpreted by the browser as it should be.

Typical relations that are found in the data include the following:

  • One field governs a length of another field.
  • One field governs an offset of another field.
  • One field governs a count of fields in a group.
  • A bit-field value governs the presence of an optional field.
  • A field is a checksum of some part of the buffer.

The mutation engine must be able to read governing field values during template parsing to understand the rest of the buffer. The mutation engine also must update governing fields after fuzzing governed fields.

Encodings

Often, a data buffer contains a recursive structure, in which one field encodes a group of other fields. One example is a mail message that contains an attachment that is encoded with base-64 encoding. To parse the attachment structure, the mutation engine must decode the encapsulated structure, and then continue parsing the inner decoded structure. After fuzzing has been performed on one of the inner fields, the structure should be encoded back (for example, by using base-64, signing, or encryption) and embedded in the fuzzed buffer.

Delivering the Fuzzed Buffer to the Tested Software

After the fuzzed buffer has been created, it must be delivered to the tested software. Wherever possible, this should be done in a stress-like fashion—doing so concurrently from multiple threads in a loop (many fuzzing bugs manifest only under stress). The data delivery itself is dependent on tested software. It can be as simple as sending fuzzed data in a single network packet or opening fuzzed data as a file. In more complex cases, special tools are needed—for example, to send IOCTLs or call APIs using fuzzed data as parameters.

Product-Testability Design for Fuzzing

It is possible to alter software design to improve testability and increase the chance of finding bugs by using fuzzing. This section discusses some related software-design considerations.

Pageheap, Stack, and Heap Canaries

Often, fuzzing triggers a small underrun or overrun—for example, an access array [-1] element—that will not cause a crash to the process, under most circumstances. A careful exploit, however, can cause information disclosure, remote code execution, or a crash.

On Microsoft Windows systems, use of heap manager with canaries or other mechanisms can assist in the detection of memory corruptions. You can turn on normal or full-page heap to get this functionality from the operating system [6].

Detecting Leaks

Fuzzing tests will trigger error-handling paths. A common defect in this scenario is not cleaning resources that were allocated during parsing. The resources can include heap memory, pooled memory, handles, and other resources.

A leak-detection system should be used during fuzzing tests. You can use systems such as IBM Rational Purify or a custom leak-detection mechanism.

Detecting Deadlocks

Another issue that is related to error-handling paths is that a lock or synchronization object might not be released—potentially causing a deadlock.

A common design pattern for deadlock detection is implementation of a "heartbeat mechanism," in which a heartbeat timestamp for every active thread is maintained. Every thread that participates in the heartbeat mechanism should periodically update (”heartbeat”) the timestamp. An additional timer thread will periodically iterate on per-thread timestamps and internally report a deadlock if a heartbeat is older than some reasonable time period (for example, 60 seconds).

Initializing Buffers with Predefined Patterns

Using or reusing memory across different requests without reinitializing it each time might disclose user-sensitive information that belongs to another user. Under fuzzing tests, tested software can be modified to fill the buffers (newly allocated or reused from a pool) with a predefined word—for example, “FZ”. Fuzzing tests should inspect output that is received from the tested software and, if this pattern is encountered, report a bug (it indicates that the memory was not reinitialized).

Asserting on Input Assumptions

In code that deals with parsing of the inputs, performance concerns tend to make it impractical and inefficient to perform input validation in the entry of every function. Typically, basic parameter validation occurs at the outer functions of the parser, while inner functions assume that these validations were already performed. However, because of programming mistakes or code refactoring, these assumptions might not hold true. Parsing code can now enter an unexpected state, which results in a fuzzing bug. To guard against this, it is recommended that you place ASSERT statements in the entry of the functions.

It is typical for fuzzing to frequently trigger ASSERTs that were never reached previously. A recommended best practice is to investigate which assumption was invalidated, or whether there is a bug in the ASSERT itself. It is often possible to find a bug through code review of the code path that triggered the ASSERT, even when a specific fuzzed packet doesn't cause a crash.

When tested software is instrumented with ASSERTs, as described, a high number of the ASSERTs that are raised might slow down the testing process, because each ASSERT has to be investigated and fixed. It is recommended that you enhance the behavior of asserts to log/trace each occurrence, instead of interrupting the run—enabling aggregation, and parallelizing ASSERT investigations with test execution.

Case Study

Here is a case study of white-box fuzz testing that was performed during development of ISA Server 2006.

The ISA Server 2006 development team rigorously follows the Microsoft Security Development Life Cycle (SDL) [8], including developer training, threat modeling, code reviews, and fixing defects that have been found by static code-analysis tools. In addition, traditional functional, system, and stress testing were performed, as were dogfooding and beta deployments.

After all of these activities, an internal testing team found over 30 Important and Critical bugs (according to MSRC ranking)—suggesting an ROI of 3.7 tester days per bug. Fuzzing demonstrated a defect density of one bug per 17 KLOC.

Of these defects, only 30 percent were found by using black-box fuzzing. The rest were found only when white-box fuzzing was applied.

Length of Test

The minimal length of a fuzz test depends on the following parameters:

  1. Complexity of the data that is being fuzzed
  2. Number of mutation templates
  3. Code complexity of the fuzzing target

We found, during fuzzing cycles, that it takes more time to find new defects—ranging from a few minutes until the first bug is found, to hours for the last bug.

We suggest that you use an MTBF (mean time between failures) metric of between 4 hours and 2 weeks—depending on business reliability requirements—of fuzzing tested software without a crash under a stress of 70 percent CPU load.

Measuring Fuzzing Coverage

We have not been able to identify a single encompassing metric that measures fuzz-testing coverage. Useful fuzzing metrics is a subject for future research. However, we find that the combination of three metrics is useful for tracking fuzzing coverage:

  • Identification of all fuzzing targets through design reviews, threat modeling, or differential code coverage [9]
  • Use of code coverage to review missed coverage in identified primary and secondary fuzzing targets, to detect missing templates, configuration parameters, or fuzzers
  • Length of test run (see previous section) for regression cycle runs

Reproducing Fuzzing Failures

As soon as a crash or other fuzzing bug is encountered, you must reproduce the test case that caused the failure. Because of the random stressful nature of a fuzzing test, a single buffer that caused the bug might not reproduce it.

Developers, therefore, are encouraged to run the tests with a debugger attached or JIT enabled, so that a crash dump will be produced. This will enable you to understand the state of the tested software during the failure and to find the defect from the crash dump.

It is possible to further automate the repros by developing a debugger plug-in or script that will be run when exception occurs, capture repro information, and restart/continue the running of the tested software.

An alternative method, which will work reliably for failures that are caused by a single request, is to log the buffers that are produced by the fuzzing engine. Another approach is to log the seeds that are used as input for the random functions of the fuzzing engine, if a random function is used for selecting the templates and fuzzers.

Integrating Fuzzing into the Software-Release Life Cycle

White-box fuzz testing requires significant efforts from test teams. It is reasonable to start with black-box fuzzing, which is much quicker because it doesn't require deep code inspection. You can then improve the coverage, as needed, according to the methodology that is described in this article.

Fuzzing should be incorporated as a core part of the product-testing process. White-box fuzz testing requires a skill set and investment that are similar to that in stress-test development. Typically, stress is focused mainly on positive paths, whereas fuzz testing extends this systematically to negative paths.

Because network fuzzing is done on external interfaces of the system, there are no false positives. A case study of ISA 2006 Security Push and the fact that most fuzzing bugs are trivial validation bugs that are straightforward to fix preclude that fuzzing is a relatively low-risk activity. This is a benefit, compared to (for example) static code analysis, which requires massive changes to the source code. Defects that are found by fuzz testing tend to be trivial validation bugs that are straightforward to fix, so that they are less prone to regressions than logic-defect bug fixes are.

Questions for Further Study

Here are a number of issues for further study:

  1. Identification of a single encompassing metric for fuzz-testing coverage
  2. Further case studies, demonstrating white-box fuzzing ROI
  3. Industry-wide CVE (publically catalogued Common Vulnerability Entries) study, to identify a common set of fuzzers for detecting common vulnerabilities

References

Glossary

  • Buffer overrun—Type of vulnerability that results in writing outside of intended memory boundaries, due to improper handling of input data. An impact of buffer overrun could be either an application crash (resulting in denial-of-service, loss of data, or information disclosure) or remote execution of attacker's code. Buffers are categorized into three buckets: stack, heap, or data segment. Each bucket differs in ease of exploitation, depending on compiler and OS implementation.
  • Crash—Condition in which an application or the OS halts application execution when an unexpected condition (such as buffer overrun) occurs.
  • Denial-of-service—Or DoS. Type of vulnerability during which application stops providing the service that it was intended to provide (usually because of resource starvation or unhandled exception).
  • Hang—Deadlock that can be triggered by a hacker and can lead to denial-of-service.
  • Information disclosure—Type of vulnerability in which an application discloses unintended information to an attacker. Information disclosure can happen as a result of a crash (for example, showing a stack dump), sending uninitialized buffer to the network, or a logical flaw in the code.
  • Leak—Or memory leak. Condition that, if triggered repeatedly by a hacker, can lead to denial-of-service.
  • Pen-testing—Or penetration testing. Used to describe verification activities that are intended to find design or implementation defects that result in security vulnerability.
  • Remote execution—Type of vulnerability in which a hacker is able to make the system execute a code of their choice. This typically requires (1) the hacker to be able to deliver their code to the target system (for example, as part of a network packet) and (2) execution control to be transferred to the hacker’s code by means of IP register overriding (or some other implicit action).
  • Vulnerability—Defect in the software that enables malicious exploitation with benefit to an attacker, not intended to be allowed by designers of the software.

About the author

John Neystadt is Lead Program Manager of the Microsoft Forefront Edge Security team. For the last four years, he has run the Forefront Edge internal penetration-testing team, ensuring that security is a top priority for developers, testers, and program managers. The result of this work is that the ISA Server 2004 and 2006 products have not shipped a security patch in the last 36 months. You can reach John at jney@microsoft.com or at http://www.neystadt.org/john/.