Article
01/25/2019

March 2017

Volume 32 Number 3

[Visual Studio]

Hashing Source Code Files with Visual Studio to Assure File Integrity

By Mike Lai | March 2017

The transformation of human-readable code to machine-readable code introduces a challenge to software assurance for all compiled software languages: How does a user have confidence that a software program running on his computer was built from the same source code file created by the developer? That’s not necessarily a certainty—even if the source code files are reviewed by subject-matter experts, as they may be in the case of open source software. A critical part of software assurance is trusting that the reviewed source code files are the same source code files that were built into executable files.

During the compilation and linking processes, a set of source code files written in a specific programming language (C#, C++, Objective C, Java and so forth) is transformed into a binary executable file for running on a computer of a specific architecture (x86, x64, ARM, for example). But this transformation may not be deterministic. It’s possible that two different sets of source code files could result in two bitwise-identical executable files. Sometimes, this is intentional. Having more or fewer whitespaces or text comments inside the source code files shouldn’t affect the binary code emitted by the compiler. On the other hand, it’s also possible that a single set of source code files could result in different executable files from different compilation processes. In either case, the problem is one of certainty—knowing for sure that the file you have is the one you want.

To address this issue, it’s helpful to use a Visual Studio compiler to hash source code files during compilation. Matching hash values from the compiler to hash values generated from examined source code files verifies that the executable code did indeed result from the particular source code files. Clearly, this is good for users (who would, in fact, benefit further if vendors of other compilers also followed a similar approach). This article describes the new Visual Studio switch for choosing a hashing algorithm, scenarios where such hashes might prove useful and how to use Visual Studio to generate source code hashes.

Generating Strong Hashes During Compilation

A program database (PDB) file is a separate data file that stores the information used to debug a binary executable file. Microsoft recently updated its various compiler file-hashing operations (such as source hashes embedded in PDB files) to use strong cryptographic algorithms.

Native Code Compiler The Visual Studio 2015 native C/C++ compiler, cl.exe, comes with a new switch for choosing a different hash algorithm for the compiler to hash source code files: /ZH:{MD5|SHA_256}. The default is MD5, which is known to be more collision-prone but remains the default because its hash values are computationally cheaper to generate. With the new switch, the compiler implements the SHA-256 option, which is cryptographically stronger than MD5.

If an SHA-256 hash for a source code file matches an SHA-256 hash stored in the PDB file of a binary executable, it’s certain that the same source code file was compiled into the executable, allowing any stakeholder to have confidence in the binary executable file. Effectively, the set of SHA-256 hash values stored in the PDB file of the binary executable file collectively becomes the identifiers in the “birth certificate” of the binary executable file, as these identifiers are registered by the compiler that “gives birth” to the binary executable file.

Using the Debug Interface Access SDK (bit.ly/2gBqKDo), it’s easy to create a simple tool such as the Debugging Information Dumper, cvdump.exe (which, along with its source code, is now available at bit.ly/2hAUhyy). You can use the -sf switch of cvdump.exe to view the listing of modules (using their full path names in the local build machine) with their MD5 or SHA-256 hashes, as shown in the command window in Figure 1.

Figure 1 Using cvdump.exe to View Modules with Their Hashes

When I used a previous version of cvdump.exe to view the same PDB file, I saw the text “0x3” instead of “SHA_256”. The 0x3 value is the enum value for SHA_256 and the updated cvdump.exe knows how to interpret it. It’s the same enum value that’s returned by the IDiaSourceFile::get_checksumType method of the Debug Interface Access SDK.

Managed Code Compiler By default, the Visual Studio 2015 managed code C# compiler, csc.exe, uses the SHA-1 cryptographic algorithm to calculate the source file checksum hash values to store in the PDB files. However, csc.exe now supports a new, optional “/checksumalgorithm” switch to specify the SHA-256 algorithm. To switch to the SHA-256 algorithm, use this option to compile all the C# files in the current directory and place the debugging information, including the source file listing and the SHA-256 hash values, in a PDB file:

csc /checksumalgorithm:SHA256 /debug+ *.cs

As part of the .NET Compiler Platform (“Roslyn”) open source project, csc.exe is available at github.com/dotnet/roslyn. You’ll find support for the SHA-256 source file debug checksum algorithm command-line selector in the file at bit.ly/2hd3rF3.

Visual Studio 2015 csc.exe is compatible only with the Microsoft .NET Framework 4 or higher executable files. The other Visual Studio 2015 .NET Framework compiler used to build executable files prior to version 4 doesn’t support the /checksumalgorithm switch.

Managed code PDB files store data differently than native code PDB files. Instead of using the Debug Interface Access SDK, Microsoft DiaSymReader interop interfaces and utilities can be used to read managed code PDB files. Microsoft DiaSymReader is available as a NuGet package from bit.ly/2hrLZJb.

The Roslyn project includes a utility called pdb2xml.exe, which you’ll find with its sources at bit.ly/2h2h596. This utility displays the content of a PDB in the XML format. For example, the segment in Figure 2 shows the listing of C# source code files that have been used to compile a managed code executable.

Figure 2 Displaying a Managed Code PDB in XML Format

The “8829d00f-11b8-4213-878b-770e8597ac16” GUID in the checkSumAlgorithmId field indicates that the value in the checksum field is an SHA-256 hash value for the file referenced in the name field. This GUID is defined in the Portable PDB Format Specification v0.1 (bit.ly/2hVYfEX).

Compiler Support for SHA-256

The following Visual Studio 2015 compilers support the option for the SHA-256 hashing of source code files:

cl.exe /ZH:SHA_256
ml.exe /ZH:SHA_256
ml64.exe /ZH:SHA_256
armasm.exe -gh:SHA_256
armasm64.exe -gh:SHA_256
csc.exe /checksumalgorithm:SHA256

These compilers are available from the “Developer Command Prompt for VS2015” command window of Visual Studio 2015.

Compilers that don’t target Windows platforms don’t generally use PDB files for storing their debugging information. These compilers typically produce two executable files simultaneously during a compilation run, one unstripped and the other stripped (bit.ly/2hIfvx6). Full debugging information is stored in the unstripped executable, while the stripped executable doesn’t contain any detailed debugging information. The unstripped executable may be suitable for storing the SHA-256 hashes of the processed source code files for the executable. We’re planning to reach out to creators of these other compilers to find out which approaches work best for their compilers so that non-Windows-based software using these compilers, such as Office for Android, Office for iOS, or Office for Mac, can get the similar benefits as Windows-based software.

Use-Case Scenarios

Now let’s look at some scenarios where the source file hash values can be useful.

Retrieving Indexed Source Files of a Portable Executable (PE) Binary File The Ssindex.cmd script (bit.ly/2haI0D6) is a utility that builds the list of (indexed) source files checked into source control, along with each file’s version information, for storing in the PDB files. If a PDB file has this version control information, you can use the srctool utility (bit.ly/2hs3WXY) with its -h option to display the information. Because the indexed source files also have their hash values embedded in the PDB file, these hash values can be used to verify the authenticity of the source files during their retrieval, as explained in KB article 3195907 (bit.ly/2hs8q0u), ““How To Retrieve Indexed Source Files of a Portable Executable Binary File.” Specifically, if the hash values don’t match, something may have gone wrong during the generation of the PE/PDB pair or in the source control system. This may warrant further investigation. On the other hand, if the hash values match, it’s a strong indication that the retrieved indexed source files were used to compile the PE/PDB pair.

Matching Hash Values Produced by a Source File Static Analyzer Today, it’s common to use automatic tools to assess the quality of software, as recommended by the Microsoft Security Development Lifecycle (SDL) for the implementation phase (bit.ly/29qEfVd). Specifically, source file static analyzers are used to scan target source code files to assess many different aspects of software quality. These static analyzers typically produce the corresponding real-time results upon scanning the target source code files. As a static analyzer scans individual source code files, it presents an excellent opportunity to also generate a strong hash (SHA-256) for each of the source code files being scanned. In fact, the Static Analysis Results Interchange Format (SARIF), proposed in the open source project at bit.ly/2ibkbwz, provides specific locations in the static analysis results for a static analyzer to produce the scanned target source code files and their SHA-256 hash values.

Given a PE file, let’s assume that the following are available:

The compiled source file hash listing from the corresponding PDB file as produced by a compiler.
The scanned source file hash listing from the corresponding static analysis result as produced by a static analyzer.

In this scenario, you can review and verify whether the two file hash listings match. If they do, the source files have been scanned for some quality assessment by a static analyzer and you don’t need to rescan the source files. Previously, in the absence of file hash listings, you might have had to rescan just to be sure that the static analyzer did a proper assessment.

A Quicker Sanity Check in the Software Update or Hotfix Development Process In situations where you need to release a software update to fix a quality issue found by a source file static analyzer for a released product, the static analyzer should report the absence of that quality issue in the source code files of the pending update. At a minimum, this report would confirm that the update is effective in addressing the original quality issue. In other words, it would validate the intended purpose of the software update. If desired, you or a security reviewer can take the following steps to conduct a quick validation:

Confirm that the original static analyzer report identifies the quality issue in question.
Confirm that the original static analyzer report includes the hash values of the source files that contain the quality issue.
Match the file hash values found in the original static analyzer report with the hash values of the source files of the released product version.
Obtain the updated static analyzer report produced from a scan of the source code files of the update using the same static analyzer.
Confirm that the previously found quality issue is absent in the static analyzer report for the update.
Match the file hash values in the updated static analyzer report with the hash values of the source files of the update.

During these validation steps, you don’t need to have access to the actual source code files of either the original released product or the update.

Constructing the Source Code Delta Between Two Versions of Software Reviewing a full set of source code can take time. However, in some cases, when there are changes to the source code, a full review of the source code isn’t necessary. As a result, you may be asked only for the source code delta. This request is certainly reasonable as there’s no rational basis for a repeat analysis of any portions unchanged since the last review.

Previously, without the cryptographically strong hash values of the source code files, it would be difficult to construct the delta subset with any precision. Even if you had a delta subset to offer, the subject matter experts might have had little confidence in your ability to accurately create the delta subset. But that’s no longer the case. With the strong cryptographic hash values of the source code files, you can use the following steps to create the delta subset:

Obtain the pool, Pool X for example, of hash values of all the source code files for the original product version.
Make an exact copy of the file directory, Dir A for example, that contains the source code enlistment of the subsequent product version from which the delta subset will be constructed.
Prepare a final file folder destination, Dir B for example, to hold only the delta file subset.
Sort through all the files in Dir A:
a. If the hash value of a file matches a hash value in Pool X, do nothing and go to the next file.
b. If the hash value of a file doesn’t match a hash value in Pool X, copy the file to Dir B before moving to the next file.
Validate that all the files in Dir B have hash values that match the corresponding hash values of the source files of the subsequent product version.
Make the content of Dir B the delta source file subset of the subsequent product version.

Generating the Hash

Now let’s take a look at how you do file hashing with the Visual Studio compilers. To do this, I’ll use the “Hello, World” application creation example from the online Visual Studio documentation (bit.ly/2haPupF) to:

Show where in the output PDB file you can find the hash values of the compiled source files
Use the certutil tool (bit.ly/2hIrnPR) to compute the source file hash values to match against the ones found in the PDB file.

To start, I create a new Win32HelloWorld application project in the Visual Studio 2015\Projects folder. In this Win32HelloWorld project, there’s only one C++ source file—Win32HelloWorld.cpp—as shown in Figure 3.

Figure 3 Win32HelloWorld.cpp

As you can see, Win32HelloWorld.cpp includes the main function, which displays the “Hello” text.

After making a build for my Win32HelloWorld project, I end up with the W32HelloWorld.exe and W32HelloWorld.pdb files in the Visual Studio 2015\Projects\W32HelloWorld\x64\Debug folder.

The cvdump tool, used with its -sf option against the W32HelloWorld.pdb file, shows the Win32HelloWorld.cpp file and its MD5 hash value in the output shown in Figure 4.

Figure 4 Cvdump Output Showing Win32HelloWorld.cpp and Its MD5 Hash Value

The hash value is MD5 because MD5 is the default algorithm for the Visual Studio 2015 compiler, cl.exe. To switch the source file hashing algorithm to SHA-256, I need to supply the /ZH:SHA_256 option to cl.exe. I can do this by adding “/ZH:SHA_256” in the Additional Options box on the Property Pages of the Win32HelloWorld project, as shown in Figure 5.

Figure 5 Switching the Source File Hashing Algorithm to SHA-256

After rebuilding in Visual Studio, I have a new PE/PDB pair of W32HelloWorld.exe and W32HelloWorld.pdb in the Visual Studio 2015\Projects\W32HelloWorld\x64\Debug folder. Now, using the cvdump tool with its -sf option against the new W32HelloWorld.pdb file displays the Win32HelloWorld.cpp file and its SHA-256 hash value in the output, as shown in Figure 6.

Figure 6 Cvdump Showing Win32HelloWorld.cpp and Its SHA-256 Hash Value

Now, I can go back to the W32HelloWorld.cpp file in the Visual Studio 2015\Projects\W32HelloWorld\W32HelloWorld folder to check out its SHA-256 hash value. Using the certutil tool with its -hashfile verb against the Win32HelloWorld.cpp file for SHA-256, I get the SHA-256 hash value shown in Figure 7.

Figure 7 Getting the SHA-256 Hash Value with Certutil

Clearly, it matches the SHA-256 value recorded in the W32HelloWorld.pdb file. This strongly indicates that the Win32HelloWorld.cpp file was indeed used to compile the W32HelloWorld.exe application, as expected.

For more details on related public tools to work with native code and managed code PE/PDB file pairs, see KB article 3195907, “How To Retrieve Indexed Source Files of a Portable Executable Binary File” (bit.ly/2hs8q0u).

Wrapping Up

I hope this article has shown some potential benefits of a stronger linkage between source code files and the PE file that was compiled with them. You can create a stronger linkage by having the compiler hash the source code files during compilation with the strongest hashing algorithm available—SHA-256. The actual hash values of the source code files produced by the compiler literally become the unique identifiers of the source code files that are used to compile an executable.

Once you understand the value of these unique identifiers, you can use them in different software development lifecycle schemes for tracking, processing and controlling source code files that have a strong linkage to the specific executable files, resulting in a higher level of end-user confidence in the executable files.

Mike Lai *just entered his 20th year of Microsoft employment. He is grateful to Microsoft for the various opportunities to contribute to the functionality and the engineering aspects of many of its products. He would like to thank his current management in Trustworthy Computing for their patience to allow his ideas becoming mature and gradual incorporation into the released products, and for their support to participate in Information and Communications Technology security standards organizations. *

Thanks to the following Microsoft technical experts for reviewing this article: Scott Field, Mike Grimm, Sue Hotelling, Ariel Netz, Richard Ward and Roy Williams

Discuss this article in the MSDN Magazine forum