Snapshot Provider Considerations while backing up a CSV Cluster

DPM traditionally supports software snapshots (Volsnap à which is available in the Windows OS by default). In DPM 2010, for protection of CSV Clusters, we strongly recommend the usage of hardware snapshots over software snapshots. Note that the recommendation applies to Production servers only. On the DPM side, only software snapshots are supported.


Now why do we recommend hardware snapshots for CSV Clusters?

To understand this we first need to understand the basic backup mechanism of CSV clusters.

A common Cluster Shared Volume (CSV) deployment would have the VHDs of the virtual machines on a CSV, with virtual machines distributed across the nodes of the cluster. Each virtual machine would have direct I/O access to its respective VHD on the CSV, irrespective of its location in the cluster. Now when a scheduled backup is triggered by the backup agent, the CSV on which the VHD of the VMs lie, need to be brought local to the node where the VM presently resides. The Backup agent triggers this CSV volume movement, takes the snapshot and starts the backup.


This is where the difference in software and hardware providers kicks in. When the backup agent takes a backup using a software snapshot, the CSV volume remains pinned to a single node not only for the entire duration of the snapshot but also for the duration of the actual backup. There is a two-fold impact on performance because of this behavior. Firstly, as Copy on Write behavior is invoked by Volsnap, Direct IO (which greatly improves performance on CSV) is impacted for the entire duration of the backup. Given that for a fully loaded cluster, backup would continuously happen on each node, you will hardly get any Direct I/O. Secondly, because of the CSV volume being pinned to a node for the entire duration of the backup, the number of VMs (on the same CSV but different nodes) that can be backed up in parallel gets severely impaired and all backups are serial. This again impacts the backup performance.


Hardware snapshots on the other hand, are ideal for the CSV environment . They allow the CSV to resume direct I/O mode as soon as the hardware snapshot has been taken. This duration is typically very short, about 2 minutes. As a result more VMs can be backed up in parallel with hardware snapshots than software snapshots.


The question that I typically get at this point is this - "Let’s assume the VMs on the same CSV are doing something disk intensive stuff. During the snapshot (even it is hardware based) there will be ~2 minutes, while the disk I/O to the CSV volume from all VMs running on node other than the one doing the backup go through the network (Redirected mode) to be written by the coordinator node (the node running the actual backup). I think these VMs can easily generate enough data that can overload the coordinator node NIC and the node itself. What should I do?"


The effect of redirected I/O during backup is one of the factors a customer needs to consider when deciding how many/how large his CSV’s should be. The redirected I/O will flow over a network dedicated to CSV, and we recommend that this link be at least GigE. As pointed out, the use of H/W snapshots will greatly mitigate this issue. Using 5+ Gig links for CSV traffic would mitigate this further for both System provider and H/W provider snapshots.