How It Works (It Just Runs Faster): Non-Volatile Memory SQL Server Tail Of Log Caching on NVDIMM

Article
11/08/2016

Note: We received feedback that there was some confusion on us calling this functionality "tail of the log caching" because our documentation and prior history has referred to the tail of the log as the portion of the hardened log that has not been backed up. This feature is actually officially called Persisted Log Buffer and is an acceleration of the hardening of the log cache as blogged by Kevin Farlee at https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/12/02/transaction-commit-latency-acceleration-using-storage-class-memory-in-windows-server-2016sql-server-2016-sp1/

SQL Server 2016 Service Pack 1 (all SKUs), in combination with Windows Server 2016 (All SKUs) or Windows 10 Client introduces non-volatile memory support for the tail of the log file (LDF) which can significantly increase transaction throughput.

Announcement and Syntax Example: https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/12/02/transaction-commit-latency-acceleration-using-storage-class-memory-in-windows-server-2016sql-server-2016-sp1/

What does it mean to go faster?

To highlight what it means to use NVDIMM look no farther than response times.

Spinning, disk drives are measured in 4ms+ response times
SSD media are measured in less than 4ms response times, usually a 1ms range
PCI NVMe SSD is measured in microseconds
NVDIMM is measured in nanoseconds

There are two ways to take advantage of non-volatile storage (NVDIMM) for SQL Server (Direct access or Block Device storage.)

Block storage is what you think of today as disk access
Direct access (/dax) takes advantage of direct memory access

How much faster you can go using direct access? Just take a look at the following I/O per second rate (IOPS) difference.

This post is targeted at explaining how the tail of the log is optimized to take advantage of direct access.

Create / Alter Database … LOG ON

SQL Server 2016 has been enhanced along with Windows 2016 to detect a direct access volume (/dax.) When you create or alter the SQL Server database you simply add the LOG ON clause, pointing to the DAX enabled volume. SQL Server will detect the request on the DAX based volume and create the non-volatile, tail of log cache of fixed size for the database. From all external aspects (DMVs for example) the tail of log cache looks like another file that is part of the database’s LOG file group.

SQL works with the operating system to detect the non-volatile storage devices and leverage them for the tail of the log. Then you use the standard ALTER LOG or CREATE DATABASE with LOG ON to enable the behavior for the database.

Tail Of Log Caching

Historically I would describe the tail of the log as the portion of the log (LDF) that is still in log cache. That portion of the log that is actively receiving transaction activity and has yet to be flushed to stable media. As the following diagram shows the SQL Server Log Cache is in memory. Various log blocks can be in motion to stable media. This is considered the tail of the log. SQL Server uses techniques such as parity bits during crash recovery to determine those blocks successfully written.

The primary design behind SQL Server’s, NVDIMM, Direct Access capabilities is to use the NVDIMM as a stable cache for the tail of log write activities. Currently the available sizes of non-volatile memory are 8GB or smaller. This is commonly too small to store the entire SQL Server log on so it makes sense to provide that tail of log optimizations. There is also a cost factor compared to SSD storage so leveraging the non-volatile memory for tail of the log provides performance with viable cost alternatives.

Non-volatile storage can be directly mapped into the process address space (looks like a VirtualAlloc you are used to.) This means that SQL Server has the ability to memcpy the log record directly to non-volatile storage at CPU and memory speeds, avoiding the block storage stack (drivers, filter managers, anti-virus, etc.) This also means the behavior of a commit changes and the density of log blocks is better.

SQL Server 2016 is designed to detect non-volatile storage devices. When you alter the database log (add a secondary log file), onto the non-volatile storage, SQL Server ignores the file size specification and creates the tail of log cache on the device for you. To the database administrator the tail of the log looks just like a file in the file list. If you loose the non-volatile storage you lose the tail of the log in the same way you lose it on a disk based subsystem.

What is created is storage for the log cache (NV LCs) and an additional WAL protocol, status array (Status LCs) that parallel the traditional LC (log cache) memory blocks.

As a quick review, without (non-volatile) tail caching,the database contains a set of log cache blocks and no NV or Status blocks. As modifications occur the log cache is shared among the various workers and when a commit occurs SQL Server closes out the targeted log caches. To close out the log cache new activity is prevented on the block, all users of the block must finish their work, the block is formatted and written to disk. Once the write is complete all commits waiting for the flush are allowed to continue.

When non-volatile caching is present the NV and Status blocks are created on the non-volatile device, changing the tail of the log capabilities. The common log cache still exists and is used to store and write the log records to the standard I/O devices and files. The NV LCs hold a mirrored copy of the log records stored in the matching LCs. When a log record is written to the LC is it also written to the NV LC. These are just memcpy operations to the in-memory LC and non-volatile backed NV LCs. As fast at the memcpy can occur to the NV LC is how fast you can achieve stable storage of the log record.

Without the NV LC, the log record is stored in the LC but is not considered stable until the LC is flushed to stable media. With NV LC at play, as soon as the memcpy is completed in the NV LC and noted in the Status LC, the log record can be considered stable. Your transaction no longer has to wait for other log records on the LC to complete, formatted and flushed. Your transaction can move forward as soon as the memcpy’s occur.

The Status LC is used to maintain proper WAL protocol. As you can imagine SQL Server needs to know what records were fully copied in the NV LC. For example, if you are in the middle of a memcpy and the machine encounters a power outage there is a very small possibility of a partial copy of the log record. When the power is restored SQL Server needs to understand the log record in the NV LC is not valid and ignore it properly.

The Status LC is used to signal which log records in the NV LC are valid. Once a memory copy is completed in the NV LC the status is updated to indicate that log record has been fully stored in stable (non-volatile) storage.

With non-volatile, tail of the log storage the pattern is

memcpy to LC
memcpy to NV LC
Set status
Return control to caller (commit is now valid)

You can clearly see that commits at the speed of memcpy will positively impact the application throughput but there are other benefits as well. LC packing is the first that comes to mind. In the traditional commit the LC would have been formatted and flushed to disk. The LC may have only been 40% full but to honor the commit request the log records have to be flushed to stable media. This means 60% of the LC was not used and you engaged the I/O subsystem sooner than you really needed to.

With the NV LC the LC no longer has to be flushed to stable media during the commit request. The memcpy to NV LC is stable media and SQL Server can allow the worker to continue. This means log records can continue to be added to the LC and NV LC until it is truly full. Then the LC is formatted and flushed to disk, meaning fewer I/O requests per commit requests and density increases for the LCs.

Today an administrator often monitors the log I/O wait (WRITELOG) behavior as a key indicator of application throughput. With the NV LC, the goal is to always have them available so the commits never have to wait on I/O. Once the LC is full, formatted and flushed (async activity) the matching LC, NV LC and Status LC can be reused. Reducing the number of LC flushes while increasing the density of the LC allows SQL Server to significantly avoid I/O path stalls that would have been encountered with a disk only based deployment.

Recovery

Recovery is another interesting aspect of tail-of-log-caching. During crash recovery the log cache is rebuilt. The LCs are read from the disk based storage and then the NV LCs are applied to the LCs. Using the Status LCs SQL Server understands which log records are valid in the NV LC and can memcpy the log records into the LC. The fix-ups occur and then recovery is allowed to continue with the latest information. This often means that with NV LCs you have a smaller data window. (I don’t want to say loss because the application can perform retries.) In a traditional installation the entire LC would have been in memory when the power outage occurred, all log records lost but transactional consistency is fully maintained. With NV LC the in-memory version of the LC is already on stable media so only the very ragged, in-motion, memcpy’s to NV LC become the unknown window.

Restore

Restore to and from a traditional or NV setup is seamless. The NV LC is only a cache so backup takes the LCs and restores the LCs just as it is designed to do today. When restoring the backup has to be valid so all LCs are intact and NV LCs are not necessary to recover the database. This also allows you to move between systems with and without non-volatile storage.

AG Secondary Only – Not Primary
For those using AlwaysOn you will find that the primary can’t take advantage of non-volatile, tail-of-log-cache but the secondary can. The current shipping mechanisms used by AlwaysOn databases are based on the LCs. When the log cache is closed out and flushed it is sent to the secondary instance. There is currently no protocol to mirror the NV LCs to a secondary so when a database is enabled for AlwaysOn the primary ignores NV LC activity.

The good news is the secondary can take advantage of the NV LC. As LCs arrive at the secondary, the secondary must respond that the LC has been properly stored on stable media. If you enabled non-volatile capability on the secondary the LC flush occurs directly to NV LC. You can off load the stable media flush (disk based) to the background and respond to the commit request faster. It also allows the secondary to leverage the NV LC activity for recovery purposes.

Block Storage

SQL Server can use storage spaces and block storage as provided by Windows Server 2016. SQL Server can use block storage as a normal I/O path. Leveraging common Win32 APIs, drivers, mini-filters and such to store SQL Server data. Use of Windows Server 2016, block storage, is fully supported. SQL Server won’t leverage the path for the tail of log optimizations but treats the device as it would a normal disk/SSD based device, achieving the increased speed from the device itself.

To use non-volatile memory for block storage reference:

Additional References

What’s Next for Persistent Memory? (Sampling of 16GB NV-DIMMs)
SK Hynix Develops the World's Highest Density 16 GB NVDIMM

Bob Dorr - Principal SQL Server Software Engineer

How It Works (It Just Runs Faster): Non-Volatile Memory SQL Server Tail Of Log Caching on NVDIMM

Additional resources