Performance Tuning for Data Deduplication

Data Deduplication involves finding and removing duplication within data without compromising its fidelity or integrity. The goal is to store more data in less space by segmenting files into small variable-sized chunks (32–128 KB), identifying duplicate chunks, and maintaining a single copy of each chunk. Redundant copies of the chunk are replaced by a reference to the single copy. The chunks are compressed and then organized into special container files in the System Volume Information folder.

In this topic:

  • Types of data on deduplication-enabled volumes

  • Types of job schedules

  • Storage and CPU

  • Memory

  • I/O throttling

  • Garbage collection

Types of data on deduplication-enabled volumes

The data type suitable for Data Deduplication on general purpose file servers are the files that are written to rarely. Apps that frequently write data to files (like Microsoft SQL Server or Microsoft Exchange Server) are not recommended to be hosted on deduplication-enabled volumes. Frequently changing files may cause the deduplication job to do unnecessary optimization work which may impact performance.

For virtual machines hosted on deduplicated volumes, the only types of virtual machines that are officially supported are client operating systems supported by VDI.

Types of job schedules

There are three types of Data Deduplication jobs:

  • Optimization (daily) Identify duplicate data and optimize duplicates away

  • Garbage Collection (weekly) Remove unreferenced chunks of data that were part of deleted files

  • Scrubbing (weekly) Identify and fix any corruptions

Storage and CPU

The Data Deduplication subsystem schedules one single threaded job per volume depending on system resources. To achieve optimal throughput, consider configuring multiple deduplication volumes, up to the number of CPU cores on the file server.

Memory

The amount of memory required by the deduplication optimization job is directly related to the number of optimization jobs that are running. During the optimization process, approximately 1 to 2 GB of RAM is necessary to process 1 TB of data per volume at maximum speed.

For example, a file server running concurrent optimization jobs on 3 volumes of 1 TB, 1 TB, and 2 TB of data respectively would need the following amount of memory, assuming a normal amount of file data changes:

Volume Volume size Memory used

Volume 1

1 TB

1-2 GB

Volume 2

1 TB

1-2 GB

Volume 3

2 TB

2-4 GB

Total for all volumes

1+1+2 * 1GB up to 2GB

4 – 8 GB RAM

 

By default, deduplication optimization will use up to 50% of a server’s memory. In this example, having 8 to 16 GB of memory available on the file server would allow the deduplication to optimally allocate the expected amount of RAM during optimization. Allowing optimization to use more memory would speed optimization throughput. The amount of RAM given to a job can be adjusted by using the Windows PowerShell cmdlet.

Start-Dedupjob <volume> -Type Optmization  -Memory <50 to 80> 

Machines where very large amount of data change between optimization job is expected may require even up to 3 GB of RAM per 1 TB of diskspace.

I/O throttling

All deduplication jobs are I/O intensive and may affect the performance of other apps when running. To alleviate potential problems, the default schedules can be modified by using Server Manager or by using the following Windows PowerShell command:

Set-DedupSchedule

To further alleviate I/O performance issues introduced by the most I/O intensive optimization jobs, I/O throttling may be manually enabled for the specific job to balance system performance. The following Windows PowerShell command to control I/O throttling:

Start-DedupJob <volume>  -Type Optimization -InputOutputThrottleLevel <Level>
where <Level> can be:   {None | Low | Medium | High | Maximum}

In the case of Maximum, deduplication jobs run with maximum throttling and deduplication will not make any progress in the presence of other I/Os resulting in very slow optimization. A throttle level of None, is the most intrusive but will process deduplication jobs fastest at the expense of all other I/O activity on the system. By default, the optimization job runs with a throttle level of Low.

Garbage collection

For file servers that have large amounts of data that is frequently created and deleted, you might need to set the garbage collection job schedule to run more frequently to keep up with the changes and delete the stale data.

Custom Garbage Collection schedules can be set by using Server Manager or by using this Windows PowerShell command:

New-DedupSchedule

Performance Tuning for Storage Subsystems

 

 

Send comments about this topic to Microsoft