Custom directory enumeration in .NET Core 2.1

Article
03/09/2018

As discussed in my previous post, this post will cover the new enumeration extensibility points.

The extensibility API we're providing is meant to allow building high-performance custom enumerators. The performance gains primarily come through utilizing the new Span types and ref structs to allow relatively safe access to native data, avoiding unnecessary allocations. The API has been designed to be as simple as possible while still maintaining the low-allocation, high-performance primary design goals.

FileSystemEntry

File system data is wrapped in the System.IO.Enumeration.FileSystemEntry struct. It provides the same set of data that the FileSystemInfo classes do, along with some enumeration specific data:

FileSystemEnumerable

There are a few ways to get FileSystemEntry data. The simplest way is through the System.IO.Enumeration.FileSystemEnumerable class:

Basic usage isn't too complicated. Let's suppose you wanted to just get the names (not the full paths) of all the files and directories in a given directory:

You can, of course, write the equivalent functionality using the existing APIs that return FileSystemInfo objects (e.g. getting .Name off of each), but it comes with significant cost. Writing a custom enumerable only takes a single allocation for each MoveNext (the filename). Getting the Info objects via existing APIs will allocate much more (the Info class, the full path, etc.) and will be slower, particularly on Unix as it will cause another fstat call to fill out data you won't ever use.

To take it another step, lets filter to just files (no directories):

Here is another, potentially more useful, example. Lets say you want to create a helper that allows you to get all files with a set of given extensions:

Let's suppose you want to count the number of files in a directory. With the new APIs you can write a solution that cuts allocations by 200x or more (yes, by a factor of 200).

The example is a little strange as we need some sort of output transform. I picked int as the type and returned 1, but it could be anything, including string and string.Empty or null.

Very similar to the above, we could total up file sizes.

The performance characteristics of this last example differ quite a bit from Unix to Windows. Allocations are roughly equivalent, but Unix does not provide length when enumerating directories. We have to make another call to get the file length. .NET being designed for Windows, the content of what is in FileSystemInfo (and therefore matches what Windows gives back during enumeration). Unix doesn't give much more than the file name. To allow for this a number of the properties are lazy in the Unix implementation. Time stamps, length, and attributes are all in this bucket. Even the filename is lazy in that we don't convert the raw UTF-8 data to char until you access it. These details are important to know for multiple reasons:

Calling properties you don't strictly need can have non-trivial cost
Sitting on the struct without calling properties can give different results depending on how long you wait (notably with time stamps- we go out of our way to keep attributes constant)
Using IsDirectory and IsHidden is better than using Attributes to check those states

FileSystemEnumerator

FileSystemEnumerable is a simple IEnumerable wrapper around FileSystemEnumerator. FileSystemEnumerable is meant to be simpler and provide a model that limits the number of types via delegate usage. FileSystemEnumerator has a few more complicated options that can be used for even more advanced scenarios.

Using the enumerator directly you can (if desired) more easily track when directories finish and have more control over errors. As the errors are native error codes, you would need to check the platform in addition to the code to know how to respond. It is not intended to be easy or commonly needed.

FileSystemName

This helper class provides filename matching methods.

Both Matches* methods allow escaping with the forward slash character '/'. MatchesWin32Expression() is a little complicated. It matches according to [MSA-FS] 2.1.4.4 Algorithm for Determining if a FileName Is in an Expression. This is the algorithm that Windows actually uses under the covers (see RtlIsNameInExpression). If you want to match the way Win32 does you first have to call TranslateWin32Expression() to get your '*' and '?' translated to the appropriate '>', '<', and '"' characters. It is there if you want to match in the Win32 style. MatchesSimpleExpression() is the recommended style of matching. Win32 rules aren't easy to intuit.

That's the summary of the changes we've introduced in 2.1.

FAQ

Why is this so complicated?

This isn't intended to be a common-use API. The existing APIs will be kept, maintained, and extended based on customer demand. We don't want to:

Have scenarios blocked waiting for new APIs to work their way through the system
Have to write "normal" APIs to address more corner cases

In order to make this a usable extension point we have to sacrifice some usability to get the necessary characteristics. Note that what people build on this will directly impact our future designs of the standard, usability focused, APIs.

Why are you using Linq in your examples?

For example clarity. Some of the examples above could be optimized further.

Why aren't you providing an `X` matcher?

We want to only provide matchers that have broad applicability. Based on feedback we can and will consider adding new matchers in the future.

How do I get other platform specific data?

This is something we're investigating for future improvements. We might, for example, expose the UTF-8 data via another interface off of the entry data (or some other mechanism).