Health Service faults

Article
08/25/2021

Applies to: Windows Server 2016

What are faults

The Health Service constantly monitors your Storage Spaces Direct cluster to detect problems and generate "faults". One new cmdlet displays any current faults, allowing you to easily verify the health of your deployment without looking at every entity or feature in turn. Faults are designed to be precise, easy to understand, and actionable.

Each fault contains five important fields:

Severity
Description of the problem
Recommended next step(s) to address the problem
Identifying information for the faulting entity
Its physical location (if applicable)

For example, here is a typical fault:

Severity: MINOR
Reason: Connectivity has been lost to the physical disk.
Recommendation: Check that the physical disk is working and properly connected.
Part: Manufacturer Contoso, Model XYZ9000, Serial 123456789
Location: Seattle DC, Rack B07, Node 4, Slot 11

Note

The physical location is derived from your fault domain configuration. For more information about fault domains, see Fault Domains in Windows Server 2016. If you do not provide this information, the location field will be less helpful - for example, it may only show the slot number.

Root cause analysis

The Health Service can assess the potential causality among faulting entities to identify and combine faults which are consequences of the same underlying problem. By recognizing chains of effect, this makes for less chatty reporting. For example, if a server is down, it is expected than any drives within the server will also be without connectivity. Therefore, only one fault will be raised for the root cause - in this case, the server.

Usage in PowerShell

To see any current faults in PowerShell, run this cmdlet:

Get-StorageSubSystem Cluster* | Debug-StorageSubSystem

This returns any faults which affect the overall Storage Spaces Direct cluster. Most often, these faults relate to hardware or configuration. If there are no faults, this cmdlet will return nothing.

Note

In a non-production environment, and at your own risk, you can experiment with this feature by triggering faults yourself - for example, by removing one physical disk or shutting down one node. Once the fault has appeared, re-insert the physical disk or restart the node and the fault will disappear again.

You can also view faults that are affecting only specific volumes or file shares with the following cmdlets:

Get-Volume -FileSystemLabel <Label> | Debug-Volume

Get-FileShare -Name <Name> | Debug-FileShare

This returns any faults that affect only the specific volume or file share. Most often, these faults relate to capacity planning, data resiliency, or features like Storage Quality-of-Service or Storage Replica.

Usage in .NET and C#

Connect

In order to query the Health Service, you will need to establish a CimSession with the cluster. To do so, you will need some things that are only available in full .NET, meaning you cannot readily do this directly from a web or mobile app. These code samples will use C#, the most straightforward choice for this data access layer.

using System.Security;
using Microsoft.Management.Infrastructure;

public CimSession Connect(string Domain = "...", string Computer = "...", string Username = "...", string Password = "...")
{
    SecureString PasswordSecureString = new SecureString();
    foreach (char c in Password)
    {
        PasswordSecureString.AppendChar(c);
    }

    CimCredential Credentials = new CimCredential(
        PasswordAuthenticationMechanism.Default, Domain, Username, PasswordSecureString);
    WSManSessionOptions SessionOptions = new WSManSessionOptions();
    SessionOptions.AddDestinationCredentials(Credentials);
    Session = CimSession.Create(Computer, SessionOptions);
    return Session;
}

The provided Username should be a local Administrator of the target Computer.

It is recommended that you construct the Password SecureString directly from user input in real-time, so their password is never stored in memory in cleartext. This helps mitigate a variety of security concerns. But in practice, constructing it as above is common for prototyping purposes.

Discover objects

With the CimSession established, you can query Windows Management Instrumentation (WMI) on the cluster.

Before you can get Faults or Metrics, you will need to get instances of several relevant objects. First, the MSFT_StorageSubSystem which represents Storage Spaces Direct on the cluster. Using that, you can get every MSFT_StorageNode in the cluster, and every MSFT_Volume, the data volumes. Finally, you will need the MSFT_StorageHealth, the Health Service itself, too.

CimInstance Cluster;
List<CimInstance> Nodes;
List<CimInstance> Volumes;
CimInstance HealthService;

public void DiscoverObjects(CimSession Session)
{
    // Get MSFT_StorageSubSystem for Storage Spaces Direct
    Cluster = Session.QueryInstances(@"root\microsoft\windows\storage", "WQL", "SELECT * FROM MSFT_StorageSubSystem")
        .First(Instance => (Instance.CimInstanceProperties["FriendlyName"].Value.ToString()).Contains("Cluster"));

    // Get MSFT_StorageNode for each cluster node
    Nodes = Session.EnumerateAssociatedInstances(Cluster.CimSystemProperties.Namespace,
        Cluster, "MSFT_StorageSubSystemToStorageNode", null, "StorageSubSystem", "StorageNode").ToList();

    // Get MSFT_Volumes for each data volume
    Volumes = Session.EnumerateAssociatedInstances(Cluster.CimSystemProperties.Namespace,
        Cluster, "MSFT_StorageSubSystemToVolume", null, "StorageSubSystem", "Volume").ToList();

    // Get MSFT_StorageHealth itself
    HealthService = Session.EnumerateAssociatedInstances(Cluster.CimSystemProperties.Namespace,
        Cluster, "MSFT_StorageSubSystemToStorageHealth", null, "StorageSubSystem", "StorageHealth").First();
}

These are the same objects you get in PowerShell using cmdlets like Get-StorageSubSystem, Get-StorageNode, and Get-Volume.

You can access all the same properties, documented at Storage Management API Classes.

using System.Diagnostics;

foreach (CimInstance Node in Nodes)
{
    // For illustration, write each node's Name to the console. You could also write State (up/down), or anything else!
    Debug.WriteLine("Discovered Node " + Node.CimInstanceProperties["Name"].Value.ToString());
}

Query faults

Invoke Diagnose to get any current faults scoped to the target CimInstance, which be the cluster or any volume.

The complete list of faults available at each scope in Windows Server 2016 is documented below.

public void GetFaults(CimSession Session, CimInstance Target)
{
    // Set Parameters (None)
    CimMethodParametersCollection FaultsParams = new CimMethodParametersCollection();
    // Invoke API
    CimMethodResult Result = Session.InvokeMethod(Target, "Diagnose", FaultsParams);
    IEnumerable<CimInstance> DiagnoseResults = (IEnumerable<CimInstance>)Result.OutParameters["DiagnoseResults"].Value;
    // Unpack
    if (DiagnoseResults != null)
    {
        foreach (CimInstance DiagnoseResult in DiagnoseResults)
        {
            // TODO: Whatever you want!
        }
    }
}

Optional: MyFault class

It may make sense for you to construct and persist your own representation of faults. For example, this MyFault class stores several key properties of faults, including the FaultId, which can be used later to associate update or remove notifications, or to deduplicate in the event that the same fault is detected multiple times, for whatever reason.

public class MyFault {
    public String FaultId { get; set; }
    public String Reason { get; set; }
    public String Severity { get; set; }
    public String Description { get; set; }
    public String Location { get; set; }

    // Constructor
    public MyFault(CimInstance DiagnoseResult)
    {
        CimKeyedCollection<CimProperty> Properties = DiagnoseResult.CimInstanceProperties;
        FaultId     = Properties["FaultId"                  ].Value.ToString();
        Reason      = Properties["Reason"                   ].Value.ToString();
        Severity    = Properties["PerceivedSeverity"        ].Value.ToString();
        Description = Properties["FaultingObjectDescription"].Value.ToString();
        Location    = Properties["FaultingObjectLocation"   ].Value.ToString();
    }
}

List<MyFault> Faults = new List<MyFault>;

foreach (CimInstance DiagnoseResult in DiagnoseResults)
{
    Faults.Add(new Fault(DiagnoseResult));
}

The complete list of properties in each fault (DiagnoseResult) is documented below.

Fault events

When Faults are created, removed, or updated, the Health Service generates WMI events. These are essential to keeping your application state in sync without frequent polling, and can help with things like determining when to send email alerts, for example. To subscribe to these events, this sample code uses the Observer Design Pattern again.

First, subscribe to MSFT_StorageFaultEvent events.

public void ListenForFaultEvents()
{
    IObservable<CimSubscriptionResult> Events = Session.SubscribeAsync(
        @"root\microsoft\windows\storage", "WQL", "SELECT * FROM MSFT_StorageFaultEvent");
    // Subscribe the Observer
    FaultsObserver<CimSubscriptionResult> Observer = new FaultsObserver<CimSubscriptionResult>(this);
    IDisposable Disposeable = Events.Subscribe(Observer);
}

Next, implement an Observer whose OnNext() method will be invoked whenever a new event is generated.

Each event contains ChangeType indicating whether a fault is being created, removed, or updated, and the relevant FaultId.

In addition, they contain all the properties of the fault itself.

class FaultsObserver : IObserver
{
    public void OnNext(T Event)
    {
        // Cast
        CimSubscriptionResult SubscriptionResult = Event as CimSubscriptionResult;

        if (SubscriptionResult != null)
        {
            // Unpack
            CimKeyedCollection<CimProperty> Properties = SubscriptionResult.Instance.CimInstanceProperties;
            String ChangeType = Properties["ChangeType"].Value.ToString();
            String FaultId = Properties["FaultId"].Value.ToString();

            // Create
            if (ChangeType == "0")
            {
                Fault MyNewFault = new MyFault(SubscriptionResult.Instance);
                // TODO: Whatever you want!
            }
            // Remove
            if (ChangeType == "1")
            {
                // TODO: Use FaultId to find and delete whatever representation you have...
            }
            // Update
            if (ChangeType == "2")
            {
                // TODO: Use FaultId to find and modify whatever representation you have...
            }
        }
    }
    public void OnError(Exception e)
    {
        // Handle Exceptions
    }
    public void OnCompleted()
    {
        // Nothing
    }
}

Understand fault lifecycle

Faults are not intended to be marked "seen" or resolved by the user. They are created when the Health Service observes a problem, and they are removed automatically and only when the Health Service can no longer observe the problem. In general, this reflects that the problem has been fixed.

However, in some cases, faults may be rediscovered by the Health Service (e.g. after failover, or due to intermittent connectivity, etc.). For this reason, it may makes sense to persist your own representation of faults, so you can easily deduplicate. This is especially important if you send email alerts or equivalent.

Properties of faults

This table presents several key properties of the fault object. For the full schema, inspect the MSFT_StorageDiagnoseResult class in storagewmi.mof.

Property	Example
FaultId	{12345-12345-12345-12345-12345}
FaultType	Microsoft.Health.FaultType.Volume.Capacity
Reason	"The volume is running out of available space."
PerceivedSeverity	5
FaultingObjectDescription	Contoso XYZ9000 S.N. 123456789
FaultingObjectLocation	Rack A06, RU 25, Slot 11
RecommendedActions	{"Expand the volume.", "Migrate workloads to other volumes."}

FaultId Unique within the scope of one cluster.

PerceivedSeverity PerceivedSeverity = { 4, 5, 6 } = { "Informational", "Warning", and "Error" }, or equivalent colors such as blue, yellow, and red.

FaultingObjectDescription Part information for hardware, typically blank for software objects.

FaultingObjectLocation Location information for hardware, typically blank for software objects.

RecommendedActions List of recommended actions, which are independent and in no particular order. Today, this list is often of length 1.

Properties of fault events

This table presents several key properties of the fault event. For the full schema, inspect the MSFT_StorageFaultEvent class in storagewmi.mof.

Note the ChangeType, which indicates whether a fault is being created, removed, or updated, and the FaultId. An event also contains all the properties of the affected fault.

Property	Example
ChangeType	0
FaultId	{12345-12345-12345-12345-12345}
FaultType	Microsoft.Health.FaultType.Volume.Capacity
Reason	"The volume is running out of available space."
PerceivedSeverity	5
FaultingObjectDescription	Contoso XYZ9000 S.N. 123456789
FaultingObjectLocation	Rack A06, RU 25, Slot 11
RecommendedActions	{"Expand the volume.", "Migrate workloads to other volumes."}

ChangeType ChangeType = { 0, 1, 2 } = { "Create", "Remove", "Update" }.

Coverage

In Windows Server 2016, the Health Service provides the following fault coverage:

PhysicalDisk (8)

FaultType: Microsoft.Health.FaultType.PhysicalDisk.FailedMedia

Severity: Warning
Reason: "The physical disk has failed."
RecommendedAction: "Replace the physical disk."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.LostCommunication

Severity: Warning
Reason: "Connectivity has been lost to the physical disk."
RecommendedAction: "Check that the physical disk is working and properly connected."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.Unresponsive

Severity: Warning
Reason: "The physical disk is exhibiting recurring unresponsiveness."
RecommendedAction: "Replace the physical disk."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.PredictiveFailure

Severity: Warning
Reason: "A failure of the physical disk is predicted to occur soon."
RecommendedAction: "Replace the physical disk."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.UnsupportedHardware

Severity: Warning
Reason: "The physical disk is quarantined because it is not supported by your solution vendor."
RecommendedAction: "Replace the physical disk with supported hardware."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.UnsupportedFirmware

Severity: Warning
Reason: "The physical disk is in quarantine because its firmware version is not supported by your solution vendor."
RecommendedAction: "Update the firmware on the physical disk to the target version."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.UnrecognizedMetadata

Severity: Warning
Reason: "The physical disk has unrecognized meta data."
RecommendedAction: "This disk may contain data from an unknown storage pool. First make sure there is no useful data on this disk, then reset the disk."

FaultType: Microsoft.Health.FaultType.PhysicalDisk.FailedFirmwareUpdate

Severity: Warning
Reason: "Failed attempt to update firmware on the physical disk."
RecommendedAction: "Try using a different firmware binary."

Virtual Disk (2)

FaultType: Microsoft.Health.FaultType.VirtualDisks.NeedsRepair

Severity: Informational
Reason: "Some data on this volume is not fully resilient. It remains accessible."
RecommendedAction: "Restoring resiliency of the data."

FaultType: Microsoft.Health.FaultType.VirtualDisks.Detached

Severity: Critical
Reason: "The volume is inaccessible. Some data may be lost."
RecommendedAction: "Check the physical and/or network connectivity of all storage devices. You may need to restore from backup."

Pool Capacity (1)

FaultType: Microsoft.Health.FaultType.StoragePool.InsufficientReserveCapacityFault

Severity: Warning
Reason: "The storage pool does not have the minimum recommended reserve capacity. This may limit your ability to restore data resiliency in the event of drive failure(s)."
RecommendedAction: "Add additional capacity to the storage pool, or free up capacity. The minimum recommended reserve varies by deployment, but is approximately 2 drives' worth of capacity."

Volume Capacity (2)¹

FaultType: Microsoft.Health.FaultType.Volume.Capacity

Severity: Warning
Reason: "The volume is running out of available space."
RecommendedAction: "Expand the volume or migrate workloads to other volumes."

FaultType: Microsoft.Health.FaultType.Volume.Capacity

Severity: Critical
Reason: "The volume is running out of available space."
RecommendedAction: "Expand the volume or migrate workloads to other volumes."

Server (3)

FaultType: Microsoft.Health.FaultType.Server.Down

Severity: Critical
Reason: "The server cannot be reached."
RecommendedAction: "Start or replace server."

FaultType: Microsoft.Health.FaultType.Server.Isolated

Severity: Critical
Reason: "The server is isolated from the cluster due to connectivity issues."
RecommendedAction: "If isolation persists, check the network(s) or migrate workloads to other nodes."

FaultType: Microsoft.Health.FaultType.Server.Quarantined

Severity: Critical
Reason: "The server is quarantined by the cluster due to recurring failures."
RecommendedAction: "Replace the server or fix the network."

Cluster (1)

FaultType: Microsoft.Health.FaultType.ClusterQuorumWitness.Error

Severity: Critical
Reason: "The cluster is one server failure away from going down."
RecommendedAction: "Check the witness resource, and restart as needed. Start or replace failed servers."

Network Adapter/Interface (4)

FaultType: Microsoft.Health.FaultType.NetworkAdapter.Disconnected

Severity: Warning
Reason: "The network interface has become disconnected."
RecommendedAction: "Reconnect the network cable."

FaultType: Microsoft.Health.FaultType.NetworkInterface.Missing

Severity: Warning
Reason: "The server {server} has missing network adapter(s) connected to cluster network {cluster network}."
RecommendedAction: "Connect the server to the missing cluster network."

FaultType: Microsoft.Health.FaultType.NetworkAdapter.Hardware

Severity: Warning
Reason: "The network interface has had a hardware failure."
RecommendedAction: "Replace the network interface adapter."

FaultType: Microsoft.Health.FaultType.NetworkAdapter.Disabled

Severity: Warning
Reason: "The network interface {network interface} is not enabled and is not being used."
RecommendedAction: "Enable the network interface."

Enclosure (6)

FaultType: Microsoft.Health.FaultType.StorageEnclosure.LostCommunication

Severity: Warning
Reason: "Communication has been lost to the storage enclosure."
RecommendedAction: "Start or replace the storage enclosure."

FaultType: Microsoft.Health.FaultType.StorageEnclosure.FanError

Severity: Warning
Reason: "The fan at position {position} of the storage enclosure has failed."
RecommendedAction: "Replace the fan in the storage enclosure."

FaultType: Microsoft.Health.FaultType.StorageEnclosure.CurrentSensorError

Severity: Warning
Reason: "The current sensor at position {position} of the storage enclosure has failed."
RecommendedAction: "Replace a current sensor in the storage enclosure."

FaultType: Microsoft.Health.FaultType.StorageEnclosure.VoltageSensorError

Severity: Warning
Reason: "The voltage sensor at position {position} of the storage enclosure has failed."
RecommendedAction: "Replace a voltage sensor in the storage enclosure."

FaultType: Microsoft.Health.FaultType.StorageEnclosure.IoControllerError

Severity: Warning
Reason: "The IO controller at position {position} of the storage enclosure has failed."
RecommendedAction: "Replace an IO controller in the storage enclosure."

FaultType: Microsoft.Health.FaultType.StorageEnclosure.TemperatureSensorError

Severity: Warning
Reason: "The temperature sensor at position {position} of the storage enclosure has failed."
RecommendedAction: "Replace a temperature sensor in the storage enclosure."

Firmware Rollout (3)

FaultType: Microsoft.Health.FaultType.FaultDomain.FailedMaintenanceMode

Severity: Warning
Reason: "Currently unable to make progress while performing firmware roll out."
RecommendedAction: "Verify all storage spaces are healthy, and that no fault domain is currently in maintenance mode."

FaultType: Microsoft.Health.FaultType.FaultDomain.FirmwareVerifyVersionFaile

Severity: Warning
Reason: "Firmware roll out was canceled due to unreadable or unexpected firmware version information after applying a firmware update."
RecommendedAction: "Restart firmware roll out once the firmware issue has been resolved."

FaultType: Microsoft.Health.FaultType.FaultDomain.TooManyFailedUpdates

Severity: Warning
Reason: "Firmware roll out was canceled due to too many physical disks failing a firmware update attempt."
RecommendedAction: "Restart firmware roll out once the firmware issue has been resolved."

Storage QoS (3)²

FaultType: Microsoft.Health.FaultType.StorQos.InsufficientThroughput

Severity: Warning
Reason: "Storage throughput is insufficient to satisfy reserves."
RecommendedAction: "Reconfigure Storage QoS policies."

FaultType: Microsoft.Health.FaultType.StorQos.LostCommunication

Severity: Warning
Reason: "The Storage QoS policy manager has lost communication with the volume."
RecommendedAction: "Please reboot nodes {nodes}"

FaultType: Microsoft.Health.FaultType.StorQos.MisconfiguredFlow

Severity: Warning
Reason: "One or more storage consumers (usually Virtual Machines) are using a non-existent policy with id {id}."
RecommendedAction: "Recreate any missing Storage QoS policies."

¹ Indicates the volume has reached 80% full (minor severity) or 90% full (major severity). ² Indicates some .vhd(s) on the volume have not met their Minimum IOPS for over 10% (minor), 30% (major), or 50% (critical) of rolling 24-hour window.

Note

The health of storage enclosure components such as fans, power supplies, and sensors is derived from SCSI Enclosure Services (SES). If your vendor does not provide this information, the Health Service cannot display it.

Additional References

Health Service in Windows Server 2016

Health Service faults

What are faults

Root cause analysis

Usage in PowerShell

Usage in .NET and C#

Connect

Discover objects

Query faults

Optional: MyFault class

Fault events

Understand fault lifecycle

Properties of faults

Properties of fault events

Coverage

PhysicalDisk (8)

FaultType: Microsoft.Health.FaultType.PhysicalDisk.FailedMedia

FaultType: Microsoft.Health.FaultType.PhysicalDisk.LostCommunication

FaultType: Microsoft.Health.FaultType.PhysicalDisk.Unresponsive

FaultType: Microsoft.Health.FaultType.PhysicalDisk.PredictiveFailure

FaultType: Microsoft.Health.FaultType.PhysicalDisk.UnsupportedHardware

FaultType: Microsoft.Health.FaultType.PhysicalDisk.UnsupportedFirmware

FaultType: Microsoft.Health.FaultType.PhysicalDisk.UnrecognizedMetadata

FaultType: Microsoft.Health.FaultType.PhysicalDisk.FailedFirmwareUpdate

Virtual Disk (2)

FaultType: Microsoft.Health.FaultType.VirtualDisks.NeedsRepair

FaultType: Microsoft.Health.FaultType.VirtualDisks.Detached

Pool Capacity (1)

FaultType: Microsoft.Health.FaultType.StoragePool.InsufficientReserveCapacityFault

Volume Capacity (2)1

FaultType: Microsoft.Health.FaultType.Volume.Capacity

FaultType: Microsoft.Health.FaultType.Volume.Capacity

Server (3)

FaultType: Microsoft.Health.FaultType.Server.Down

FaultType: Microsoft.Health.FaultType.Server.Isolated

FaultType: Microsoft.Health.FaultType.Server.Quarantined

Cluster (1)

FaultType: Microsoft.Health.FaultType.ClusterQuorumWitness.Error

Network Adapter/Interface (4)

FaultType: Microsoft.Health.FaultType.NetworkAdapter.Disconnected

FaultType: Microsoft.Health.FaultType.NetworkInterface.Missing

FaultType: Microsoft.Health.FaultType.NetworkAdapter.Hardware

FaultType: Microsoft.Health.FaultType.NetworkAdapter.Disabled

Enclosure (6)

FaultType: Microsoft.Health.FaultType.StorageEnclosure.LostCommunication

FaultType: Microsoft.Health.FaultType.StorageEnclosure.FanError

FaultType: Microsoft.Health.FaultType.StorageEnclosure.CurrentSensorError

FaultType: Microsoft.Health.FaultType.StorageEnclosure.VoltageSensorError

FaultType: Microsoft.Health.FaultType.StorageEnclosure.IoControllerError

FaultType: Microsoft.Health.FaultType.StorageEnclosure.TemperatureSensorError

Firmware Rollout (3)

FaultType: Microsoft.Health.FaultType.FaultDomain.FailedMaintenanceMode

FaultType: Microsoft.Health.FaultType.FaultDomain.FirmwareVerifyVersionFaile

FaultType: Microsoft.Health.FaultType.FaultDomain.TooManyFailedUpdates

Storage QoS (3)2

FaultType: Microsoft.Health.FaultType.StorQos.InsufficientThroughput

FaultType: Microsoft.Health.FaultType.StorQos.LostCommunication

FaultType: Microsoft.Health.FaultType.StorQos.MisconfiguredFlow

Additional References

Feedback

Additional resources

Volume Capacity (2)¹

Storage QoS (3)²