了解重複資料刪除Understanding Data Deduplication

適用於:Windows Server (半年度管道)、Windows Server 2016Applies to: Windows Server (Semi-Annual Channel), Windows Server 2016

本文件說明重複資料刪除的運作方式。This document describes how Data Deduplication works.

重複資料刪除如何運作?How does Data Deduplication work?

WindowsServer 中的重複資料刪除是以下列兩個原則建立:Data Deduplication in Windows Server was created with the following two principles:

  1. 最佳化不應妨礙對磁碟的寫入作業Optimization should not get in the way of writes to the disk
    重複資料刪除會使用後置處理模型來將資料最佳化。Data Deduplication optimizes data by using a post-processing model. 所有資料都會在未經最佳化的情況下寫入磁碟,並在稍後由重複資料刪除最佳化。All data is written unoptimized to the disk and then optimized later by Data Deduplication.

  2. 最佳化不應該變更存取語意Optimization should not change access semantics
    使用者和應用程式在存取已最佳化磁碟區上的資料時,完全不知道他們存取的檔案已經過重複刪除處理。Users and applications that access data on an optimized volume are completely unaware that the files they are accessing have been deduplicated.

針對磁碟區啟用之後,重複資料刪除就會在背景執行,以:Once enabled for a volume, Data Deduplication runs in the background to:

  • 找出該磁碟區上檔案的重複模式。Identify repeated patterns across files on that volume.
  • 使用稱為重新分析點並指向該區塊唯一複本的特殊指標,順暢地移動這些部分或區塊。Seamlessly move those portions, or chunks, with special pointers called reparse points that point to a unique copy of that chunk.

這會在下列四個步驟中發生︰This occurs in the following four steps:

  1. 掃描檔案系統以尋找符合最佳化原則的檔案。Scan the file system for files meeting the optimization policy.
    掃描檔案系統
  2. 將檔案分成可變更大小的區塊。Break files into variable-size chunks.
    將檔案分成多個區塊
  3. 識別唯一的區塊。Identify unique chunks.
    識別唯一的區塊
  4. 將區塊放入區塊存放區,並選擇性壓縮。Place chunks in the chunk store and optionally compress.
    移至區塊存放區
  5. 使用重新分析點將區塊存放區中的原始檔案資料流取代為現在已最佳化的檔案。Replace the original file stream of now optimized files with a reparse point to the chunk store.
    使用重新分析點取代檔案資料流

讀取最佳化檔案時,檔案系統會使用重新分析點將檔案傳送給重複資料刪除檔案系統篩選器 (Dedup.sys)。When optimized files are read, the file system sends the files with a reparse point to the Data Deduplication file system filter (Dedup.sys). 篩選器會將讀取作業重新導向至區塊存放區中構成該檔案資料流的適當區塊。The filter redirects the read operation to the appropriate chunks that constitute the stream for that file in the chunk store. 修改某個範圍內已經過重複資料刪除處理的檔案,會在未經最佳化的情況下寫入磁碟,並在下一次執行最佳化工作時進行最佳化。Modifications to ranges of a deduplicated files get written unoptimized to the disk and are optimized by the Optimization job the next time it runs.

使用類型Usage Types

下列使用類型針對一般工作負載提供合理的重複資料刪除功能組態︰The following Usage Types provide reasonable Data Deduplication configuration for common workloads:

使用類型Usage Type 理想的工作負載Ideal workloads 不同的地方What's different
DefaultDefault 一般用途檔案伺服器︰General purpose file server:
  • 小組共用Team shares
  • 工作資料夾Work Folders
  • 資料夾重新導向Folder redirection
  • 軟體開發共用Software development shares
  • 背景最佳化Background optimization
  • 預設的最佳化原則︰Default optimization policy:
    • 最短檔案存在時間 = 3 天Minimum file age = 3 days
    • 最佳化使用中檔案 = 否Optimize in-use files = No
    • 最佳化部分檔案 = 否Optimize partial files = No
Hyper-VHyper-V 虛擬桌面基礎結構 (VDI) 伺服器Virtualized Desktop Infrastructure (VDI) servers
  • 背景最佳化Background optimization
  • 預設的最佳化原則︰Default optimization policy:
    • 最短檔案存在時間 = 3 天Minimum file age = 3 days
    • 最佳化使用中檔案 = 是Optimize in-use files = Yes
    • 最佳化部分檔案 = 是Optimize partial files = Yes
  • Hyper-V Interop 的內部調整"Under-the-hood" tweaks for Hyper-V interop
備份Backup 虛擬備份應用程式,例如 Microsoft Data Protection Manager (DPM)Virtualized backup applications, such as Microsoft Data Protection Manager (DPM)
  • 優先順序最佳化Priority optimization
  • 預設的最佳化原則︰Default optimization policy:
    • 最短檔案存在時間 = 0 天Minimum file age = 0 days
    • 最佳化使用中檔案 = 是Optimize in-use files = Yes
    • 最佳化部分檔案 = 否Optimize partial files = No
  • 使用 DPM/DPM 型解決方案針對 Interop 的內部調整"Under-the-hood" tweaks for interop with DPM/DPM-like solutions

JobsJobs

重複資料刪除會使用後置處理策略來最佳化及維護磁碟區的空間效率。Data Deduplication uses a post-processing strategy to optimize and maintain a volume's space efficiency.

工作名稱Job name 工作描述Job descriptions 預設排程Default schedule
OptimizationOptimization 最佳化工作會透過根據磁碟區原則設定將磁碟區上的資料分成區塊、(選擇性) 壓縮這些區塊,並在區塊存放區中儲存唯一的區塊,來進行重複資料刪除。The Optimization job deduplicates by chunking data on a volume per the volume policy settings, (optionally) compressing those chunks, and storing chunks uniquely in the chunk store. 重複資料刪除使用的最佳化程序已在重複資料刪除如何運作?詳細說明。The optimization process that Data Deduplication uses is described in detail in How does Data Deduplication work?. 每小時一次Once every hour
記憶體回收Garbage Collection 記憶體回收工作可透過將最近已修改或刪除的檔案已經不再參考的不必要區塊移除,來回收磁碟空間。The Garbage Collection job reclaims disk space by removing unnecessary chunks that are no longer being referenced by files that have been recently modified or deleted. 每個星期六上午 2:35Every Saturday at 2:35 AM
完整性清除Integrity Scrubbing 完整性清除工作可識別區塊存放區中因為磁碟失敗或磁區損毀所造成的損毀。The Integrity Scrubbing job identifies corruption in the chunk store due to disk failures or bad sectors. 可能的話,重複資料刪除可以自動利用磁碟區功能 (例如儲存空間磁碟區上的鏡像或同位檢查) 來重建損毀的資料。When possible, Data Deduplication can automatically use volume features (such as mirror or parity on a Storage Spaces volume) to reconstruct the corrupted data. 此外,當常用區塊被參考 100 次以上後,重複資料刪除會將它們的備份副本保存在稱為作用區的地方。Additionally, Data Deduplication keeps backup copies of popular chunks when they are referenced more than 100 times in an area called the hotspot. 每個星期六上午 3:35Every Saturday at 3:35 AM
取消最佳化Unoptimization 取消最佳化是只能手動執行的特殊工作,可以復原重複資料刪除所完成的最佳化,以及停用該磁碟區的重複資料刪除。The Unoptimization job, which is a special job that should only be run manually, undoes the optimization done by deduplication and disables Data Deduplication for that volume. 限依需求On-demand only

重複資料刪除術語Data Deduplication terminology

詞彙Term 定義Definition
區塊Chunk 區塊是已由重複資料刪除區塊化演算法選取的檔案區塊,就像其他類似檔案會發生的一樣。A chunk is a section of a file that has been selected by the Data Deduplication chunking algorithm as likely to occur in other, similar files.
區塊存放區Chunk store 區塊存放區是 [系統磁碟區資訊] 資料夾中有組織的一系列容器檔案,重複資料刪除會用來儲存唯一的區塊。The chunk store is an organized series of container files in the System Volume Information folder that Data Deduplication uses to uniquely store chunks.
DedupDedup 重複資料刪除的縮寫,一般會在 PowerShell、WindowsServer API 和元件,以及 WindowsServer 社群中使用。An abbreviation for Data Deduplication that's commonly used in PowerShell, Windows Server APIs and components, and the Windows Server community.
檔案中繼資料File metadata 每個檔案都會包含中繼資料,說明與檔案有關但與檔案主要內容無關的有趣屬性。Every file contains metadata that describes interesting properties about the file that are not related to the main content of the file. 例如,建立日期、上次讀取日期、作者等。For instance, Date Created, Last Read Date, Author, etc.
檔案資料流File stream 檔案資料流是檔案的主要內容。The file stream is the main content of the file. 這是重複資料刪除執行最佳化的檔案部分。This is the part of the file that Data Deduplication optimizes.
檔案系統File system 檔案系統是儲存媒體上的軟體與磁碟上資料結構,讓作業系統用來在儲存媒體上儲存檔案。The file system is the software and on-disk data structure that the operating system uses to store files on storage media. NTFS 格式化磁碟區支援重複資料刪除。Data Deduplication is supported on NTFS formatted volumes.
檔案系統篩選器File system filter 檔案系統篩選器是修改檔案系統預設行為的外掛程式。A file system filter is a plugin that modifies the default behavior of the file system. 為了保留存取語意,重複資料刪除會使用檔案系統篩選器 (Dedup.sys) 將讀取重新導向至已經完全最佳化,且可供提出讀取要求的使用者或應用程式讀取的內容。To preserve access semantics, Data Deduplication uses a file system filter (Dedup.sys) to redirect reads to optimized content completely transparently to the user or application that makes the read request.
OptimizationOptimization 如果檔案已經過區塊處理且其唯一區塊已經儲存在區塊存放區中,重複資料刪除會認為檔案已最佳化 (或是已進行重複資料刪除)。A file is considered optimized (or deduplicated) by Data Deduplication if it has been chunked, and its unique chunks have been stored in the chunk store.
最佳化原則Optimization policy 最佳化原則會指定要將哪些檔案視為應該要進行重複資料刪除。The optimization policy specifies the files that should be considered for Data Deduplication. 例如,如果檔案是全新、已開啟、位於磁碟區上的特定路徑或特定檔案類型,就會被視為是違反原則。For example, files may be considered out-of-policy if they are brand new, open, in a certain path on the volume, or a certain file type.
重新分析點Reparse point 重新分析點是特殊的標記,會通知檔案系統將 I/O 交給指定的檔案系統篩選器。A reparse point is a special tag that notifies the file system to pass off I/O to a specified file system filter. 當檔案的檔案資料流已最佳化時,重複資料刪除會以重新分析點取代檔案資料流,這可以讓重複資料刪除保留對該檔案的存取語意。When a file's file stream has been optimized, Data Deduplication replaces the file stream with a reparse point, which enables Data Deduplication to preserve the access semantics for that file.
磁碟區Volume 磁碟區是邏輯儲存體磁碟機的 Windows 建構,可以會跨一或多部伺服器上的多個實體存放裝置。A volume is a Windows construct for a logical storage drive that may span multiple physical storage devices across a one or more servers. 重複資料刪除是以磁碟區為依據,在磁碟區上啟用。Deduplication is enabled on a volume-by-volume basis.
工作負載Workload 工作負載是在 WindowsServer 上執行的應用程式。A workload is an application that runs on Windows Server. 範例工作負載包括一般用途的檔案伺服器、Hyper-V 和 SQL Server。Example workloads include general purpose file server, Hyper-V, and SQL Server.

警告

除非已獲授權的 Microsoft 支援人員另有指示,否則請勿嘗試以手動方式修改區塊存放區。Unless instructed by authorized Microsoft Support Personnel, do not attempt to manually modify the chunk store. 因為這樣做可能會導致資料損毀或遺失。Doing so may result in data corruption or loss.

常見問題集Frequently asked questions

重複資料刪除和其他最佳化產品有何不同?How does Data Deduplication differ from other optimization products?
重複資料刪除和其他常見的存放裝置最佳化產品之間有幾個重要差異︰There are several important differences between Data Deduplication and other common storage optimization products:

  • 重複資料刪除和儲存單一版本有何不同?How does Data Deduplication differ from Single Instance Store?
    儲存單一版本 (或 SIS) 是重複資料刪除的前身技術,於 Windows Storage Server 2008 R2 中首度引進。Single Instance Store, or SIS, is a technology that preceded Data Deduplication and was first introduced in Windows Storage Server 2008 R2. 為了將磁碟區最佳化,儲存單一版本已識別完全相同的檔案,並以 SIS 一般存放區中所儲存之單一檔案複本的邏輯連結取代那些檔案。To optimize a volume, Single Instance Store identified files that were completely identical and replaced them with logical links to a single copy of a file that's stored in the SIS common store. 不同於儲存單一版本,重複資料刪除可以從沒有完全相同但是共用許多常見模式的檔案,以及本身包含許多重複模式的檔案,取得空間以節省空間。Unlike Single Instance Store, Data Deduplication can get space savings from files that are not identical but share many common patterns and from files that themselves contain many repeated patterns. 儲存單一版本已在 WindowsServer 2012 R2 中淘汰,並在 WindowsServer 2016 中移除以改用重複資料刪除。Single Instance Store was deprecated in Windows Server 2012 R2 and removed in Windows Server 2016 in favor of Data Deduplication.

  • 重複資料刪除和 NTFS 壓縮有何不同?How does Data Deduplication differ from NTFS compression?
    NTFS 壓縮是一項 NTFS 功能,可於磁碟區層級選擇性啟用。NTFS compression is a feature of NTFS that you can optionally enable at the volume level. 使用 NTFS 壓縮,每個檔案都會在寫入時透過壓縮來個別最佳化。With NTFS compression, each file is optimized individually via compression at write-time. 不同於 NTFS 壓縮,重複資料刪除可以在磁碟區上的所有檔案間取得空間以節省空間。Unlike NTFS compression, Data Deduplication can get spacing savings across all the files on a volume. 這一點優於 NTFS 壓縮,因為檔案可能會同時有內部重複資料刪除 (由 NTFS 壓縮解決) 並且與磁碟區上的其他檔案有相似點 (不會由 NTFS 壓縮解決)。This is better than NTFS compression because files may have both internal duplication (which is addressed by NTFS compression) and have similarities with other files on the volume (which is not addressed by NTFS compression). 此外,重複資料刪除擁有後置處理模組,這表示新檔案或修改過的檔案 都會以未最佳化的方式寫入磁碟,之後再由重複資料刪除進行最佳化。Additionally, Data Deduplication has a post-processing model, which means that new or modified files will be written to disk unoptimized and will be optimized later by Data Deduplication.

  • 重複資料刪除和封存檔案格式,如 zip、rar、7z、cab 等有何不同?How does Data Deduplication differ from archive file formats like zip, rar, 7z, cab, etc.?
    zip、rar、7z、cab 等封存檔案格式是對一組指定檔案執行壓縮。Archive file formats, like zip, rar, 7z, cab, etc., perform compression over a specified set of files. 與重複資料刪除類似,會最佳化檔案內的重複模式與跨檔案的重複模式。Like Data Deduplication, duplicated patterns within files and duplicated patterns across files are optimized. 不過,您必須選擇要包含在封存中的檔案。However, you have to choose the files that you want to include in the archive. 存取語意也會不同。Access semantics are different, too. 若要存取封存內的特定檔案,您必須開啟封存,並選取特定檔案,然後解壓縮該檔案以供使用。To access a specific file within the archive, you have to open the archive, select a specific file, and decompress that file for use. 重複資料刪除會針對使用者與系統管理員,以透明的方式運作,且不需要任何手動執行。Data Deduplication operates transparently to users and administrators and requires no manual kick-off. 此外,重複資料刪除會保留存取語意:已最佳化的檔案會在最佳化之後維持不變。Additionally, Data Deduplication preserves access semantics: optimized files appear unchanged after optimization.

我可以針對我所選的使用類型,變更重複資料刪除設定嗎?Can I change the Data Deduplication settings for my selected Usage Type?
是。Yes. 雖然重複資料刪除會針對建議的工作負載提供合理的預設值,但是您仍然可能會想要調整重複資料刪除設定,以充分利用您的存放裝置。Although Data Deduplication provides reasonable defaults for Recommended workloads, you might still want to tweak Data Deduplication settings to get the most out of your storage. 此外,其他工作負載將需要一些調整,以確保重複資料刪除不會干擾工作負載Additionally, other workloads will require some tweaking to ensure that Data Deduplication does not interfere with the workload.

我可以透過手動方式執行重複資料刪除工作嗎?Can I manually run a Data Deduplication job?
可以,所有的重複資料刪除工作都可以手動執行Yes, all Data Deduplication jobs may be run manually. 如果排程的工作因為系統資源不足,或是因為發生錯誤而未執行,就可能需要手動執行。This may be desirable if scheduled jobs did not run due to insufficient system resources or because of an error. 此外,取消最佳化工作僅能手動執行。Additionally, the Unoptimization job can only be run manually.

我可以監視重複資料刪除工作的歷程記錄結果嗎?Can I monitor the historical outcomes of Data Deduplication jobs?
可以,所有的重複資料刪除工作都會在 Windows 事件日誌中產生項目Yes, all Data Deduplication jobs make entries in the Windows Event Log.

我可以針對我系統上的重複資料刪除工作變更預設排程嗎?Can I change the default schedules for the Data Deduplication jobs on my system?
可以,所有的排程都可以設定Yes, all schedules are configurable. 建議您最好修改預設的重複資料刪除排程,以確保重複資料刪除工作有足夠時間完成,且不會與工作負載競用資源。Modifying the default Data Deduplication schedules is particularly desirable to ensure that the Data Deduplication jobs have time to finish and do not compete for resources with the workload.