运行状况服务故障Health Service faults

适用于:Windows Server 2019、Windows Server 2016Applies to: Windows Server 2019, Windows Server 2016

什么是故障What are faults

运行状况服务不断监视存储空间直通群集,以检测问题并生成 "故障"。The Health Service constantly monitors your Storage Spaces Direct cluster to detect problems and generate "faults". 一个新 cmdlet 显示所有当前错误,使你可以轻松地验证部署的运行状况,而无需依次查看每个实体或功能。One new cmdlet displays any current faults, allowing you to easily verify the health of your deployment without looking at every entity or feature in turn. 故障的描述非常精确、易于理解,且可操作。Faults are designed to be precise, easy to understand, and actionable.

每个故障都包含五个重要字段:Each fault contains five important fields:

  • severitySeverity
  • 问题的描述Description of the problem
  • 用于解决问题的建议后续步骤Recommended next step(s) to address the problem
  • 故障实体的标识信息Identifying information for the faulting entity
  • 其物理位置(如果适用)Its physical location (if applicable)

例如,以下是一个典型的故障:For example, here is a typical fault:

Severity: MINOR                                         
Reason: Connectivity has been lost to the physical disk.                           
Recommendation: Check that the physical disk is working and properly connected.    
Part: Manufacturer Contoso, Model XYZ9000, Serial 123456789                        
Location: Seattle DC, Rack B07, Node 4, Slot 11

备注

物理位置源自故障域配置。The physical location is derived from your fault domain configuration. 有关容错域的详细信息,请参阅Windows Server 2016 中的容错域For more information about fault domains, see Fault Domains in Windows Server 2016. 如果未提供此信息,位置字段将不那么有用 - 例如,它可能仅显示插槽编号。If you do not provide this information, the location field will be less helpful - for example, it may only show the slot number.

根本原因分析Root cause analysis

运行状况服务可以评估发生错误的实体之间的潜在因果关系,以确定并合并导致相同基本问题的错误。The Health Service can assess the potential causality among faulting entities to identify and combine faults which are consequences of the same underlying problem. 通过识别作用链,可降低报告的繁琐性。By recognizing chains of effect, this makes for less chatty reporting. 例如,如果服务器停机,则该服务器中的任何驱动器都应该不会连接。For example, if a server is down, it is expected than any drives within the server will also be without connectivity. 因此,根本原因(在本例中为服务器)将只引发一次错误。Therefore, only one fault will be raised for the root cause - in this case, the server.

在 PowerShell 中的用法Usage in PowerShell

若要查看 PowerShell 中的任何当前错误,请运行以下 cmdlet:To see any current faults in PowerShell, run this cmdlet:

Get-StorageSubSystem Cluster* | Debug-StorageSubSystem  

这会返回影响整体存储空间直通群集的任何故障。This returns any faults which affect the overall Storage Spaces Direct cluster. 大多数情况下,这些故障与硬件或配置相关。Most often, these faults relate to hardware or configuration. 如果没有故障,此 cmdlet 将不返回任何内容。If there are no faults, this cmdlet will return nothing.

备注

在非生产环境中,你可以通过自行触发故障来试验这项功能,例如,删除一个物理磁盘或关闭一个节点。In a non-production environment, and at your own risk, you can experiment with this feature by triggering faults yourself - for example, by removing one physical disk or shutting down one node. 出现错误后,重新插入物理磁盘或重新启动节点,故障将再次消失。Once the fault has appeared, re-insert the physical disk or restart the node and the fault will disappear again.

你还可以查看只影响特定卷或文件共享的错误,这些错误只影响以下 cmdlet:You can also view faults that are affecting only specific volumes or file shares with the following cmdlets:

Get-Volume -FileSystemLabel <Label> | Debug-Volume  

Get-FileShare -Name <Name> | Debug-FileShare  

这会返回仅影响特定卷或文件共享的任何错误。This returns any faults that affect only the specific volume or file share. 大多数情况下,这些故障与容量规划、数据复原或功能(例如存储服务质量或存储副本)相关。Most often, these faults relate to capacity planning, data resiliency, or features like Storage Quality-of-Service or Storage Replica.

.NET 和中的用法C#Usage in .NET and C#

连接Connect

若要查询运行状况服务,需要建立与群集的CimSessionIn order to query the Health Service, you will need to establish a CimSession with the cluster. 为此,你将需要一些仅适用于完整 .NET 的功能,这意味着你无法直接从 web 或移动应用程序中执行此操作。To do so, you will need some things that are only available in full .NET, meaning you cannot readily do this directly from a web or mobile app. 这些代码示例将使用 C @ no__t,这是此数据访问层最简单的选择。These code samples will use C#, the most straightforward choice for this data access layer.

...
using System.Security;
using Microsoft.Management.Infrastructure;

public CimSession Connect(string Domain = "...", string Computer = "...", string Username = "...", string Password = "...")
{
    SecureString PasswordSecureString = new SecureString();
    foreach (char c in Password)
    {
        PasswordSecureString.AppendChar(c);
    }

    CimCredential Credentials = new CimCredential(
        PasswordAuthenticationMechanism.Default, Domain, Username, PasswordSecureString);
    WSManSessionOptions SessionOptions = new WSManSessionOptions();
    SessionOptions.AddDestinationCredentials(Credentials);
    Session = CimSession.Create(Computer, SessionOptions);
    return Session;
}

提供的用户名应该是目标计算机的本地管理员。The provided Username should be a local Administrator of the target Computer.

建议你实时直接从用户输入构造密码SecureString ,因此其密码绝不会以明文形式存储在内存中。It is recommended that you construct the Password SecureString directly from user input in real-time, so their password is never stored in memory in cleartext. 这有助于缓解各种安全问题。This helps mitigate a variety of security concerns. 但实际上,对其进行构造是出于原型的目的。But in practice, constructing it as above is common for prototyping purposes.

发现对象Discover objects

建立CimSession后,可以在群集上查询 WINDOWS MANAGEMENT INSTRUMENTATION (WMI)。With the CimSession established, you can query Windows Management Instrumentation (WMI) on the cluster.

你需要获取多个相关对象的实例,然后才能获取错误或度量值。Before you can get Faults or Metrics, you will need to get instances of several relevant objects. 首先, MSFT @ no__t-1StorageSubSystem表示群集上存储空间直通。First, the MSFT_StorageSubSystem which represents Storage Spaces Direct on the cluster. 使用它,你可以获取群集中的每个msft @ no__t-1StorageNode ,以及每个msft @ no__t-3Volume,这些数据卷。Using that, you can get every MSFT_StorageNode in the cluster, and every MSFT_Volume, the data volumes. 最后,还需要MSFT @ no__t-1StorageHealth,运行状况服务本身。Finally, you will need the MSFT_StorageHealth, the Health Service itself, too.

CimInstance Cluster;
List<CimInstance> Nodes;
List<CimInstance> Volumes;
CimInstance HealthService;

public void DiscoverObjects(CimSession Session)
{
    // Get MSFT_StorageSubSystem for Storage Spaces Direct
    Cluster = Session.QueryInstances(@"root\microsoft\windows\storage", "WQL", "SELECT * FROM MSFT_StorageSubSystem")
        .First(Instance => (Instance.CimInstanceProperties["FriendlyName"].Value.ToString()).Contains("Cluster"));

    // Get MSFT_StorageNode for each cluster node
    Nodes = Session.EnumerateAssociatedInstances(Cluster.CimSystemProperties.Namespace,
        Cluster, "MSFT_StorageSubSystemToStorageNode", null, "StorageSubSystem", "StorageNode").ToList();

    // Get MSFT_Volumes for each data volume
    Volumes = Session.EnumerateAssociatedInstances(Cluster.CimSystemProperties.Namespace,
        Cluster, "MSFT_StorageSubSystemToVolume", null, "StorageSubSystem", "Volume").ToList();

    // Get MSFT_StorageHealth itself
    HealthService = Session.EnumerateAssociatedInstances(Cluster.CimSystemProperties.Namespace,
        Cluster, "MSFT_StorageSubSystemToStorageHealth", null, "StorageSubSystem", "StorageHealth").First();
}

这些对象是在 PowerShell 中使用StorageSubSystemStorageNodeVolume等 cmdlet 获取的相同对象。These are the same objects you get in PowerShell using cmdlets like Get-StorageSubSystem, Get-StorageNode, and Get-Volume.

可以访问存储管理 API 类中所述的所有相同属性。You can access all the same properties, documented at Storage Management API Classes.

...
using System.Diagnostics;

foreach (CimInstance Node in Nodes)
{
    // For illustration, write each node's Name to the console. You could also write State (up/down), or anything else!
    Debug.WriteLine("Discovered Node " + Node.CimInstanceProperties["Name"].Value.ToString());
}

查询错误Query faults

调用 "诊断" 以获取范围为目标CimInstance的任何当前错误,即群集或任何卷。Invoke Diagnose to get any current faults scoped to the target CimInstance, which be the cluster or any volume.

下面介绍了 Windows Server 2016 中每个范围内可用的错误的完整列表。The complete list of faults available at each scope in Windows Server 2016 is documented below.

public void GetFaults(CimSession Session, CimInstance Target)
{
    // Set Parameters (None)
    CimMethodParametersCollection FaultsParams = new CimMethodParametersCollection();
    // Invoke API
    CimMethodResult Result = Session.InvokeMethod(Target, "Diagnose", FaultsParams);
    IEnumerable<CimInstance> DiagnoseResults = (IEnumerable<CimInstance>)Result.OutParameters["DiagnoseResults"].Value;
    // Unpack
    if (DiagnoseResults != null)
    {
        foreach (CimInstance DiagnoseResult in DiagnoseResults)
        {
            // TODO: Whatever you want!
        }
    }
}

可选:MyFault 类Optional: MyFault class

构造和保留自己的错误表示可能非常有意义。It may make sense for you to construct and persist your own representation of faults. 例如,此MyFault类存储错误的几个关键属性,其中包括FaultId,稍后可将其用于关联 update 或 remove 通知,或者在多次检测到相同错误的情况下使用删除重复。出于任何原因。For example, this MyFault class stores several key properties of faults, including the FaultId, which can be used later to associate update or remove notifications, or to deduplicate in the event that the same fault is detected multiple times, for whatever reason.

public class MyFault {
    public String FaultId { get; set; }
    public String Reason { get; set; }
    public String Severity { get; set; }
    public String Description { get; set; }
    public String Location { get; set; }

    // Constructor
    public MyFault(CimInstance DiagnoseResult)
    {
        CimKeyedCollection<CimProperty> Properties = DiagnoseResult.CimInstanceProperties;
        FaultId     = Properties["FaultId"                  ].Value.ToString();
        Reason      = Properties["Reason"                   ].Value.ToString();
        Severity    = Properties["PerceivedSeverity"        ].Value.ToString();
        Description = Properties["FaultingObjectDescription"].Value.ToString();
        Location    = Properties["FaultingObjectLocation"   ].Value.ToString();
    }
}
List<MyFault> Faults = new List<MyFault>;

foreach (CimInstance DiagnoseResult in DiagnoseResults)
{
    Faults.Add(new Fault(DiagnoseResult));
}

下面介绍了每个错误(DiagnoseResult)中属性的完整列表。The complete list of properties in each fault (DiagnoseResult) is documented below.

错误事件Fault events

创建、删除或更新错误时,运行状况服务会生成 WMI 事件。When Faults are created, removed, or updated, the Health Service generates WMI events. 这对于在不频繁轮询的情况下保持应用程序状态保持同步非常重要,例如,有助于确定何时发送电子邮件警报。These are essential to keeping your application state in sync without frequent polling, and can help with things like determining when to send email alerts, for example. 为了订阅这些事件,此示例代码再次使用观察程序设计模式。To subscribe to these events, this sample code uses the Observer Design Pattern again.

首先,订阅MSFT @ no__t-1StorageFaultEvent事件。First, subscribe to MSFT_StorageFaultEvent events.

public void ListenForFaultEvents()
{
    IObservable<CimSubscriptionResult> Events = Session.SubscribeAsync(
        @"root\microsoft\windows\storage", "WQL", "SELECT * FROM MSFT_StorageFaultEvent");
    // Subscribe the Observer
    FaultsObserver<CimSubscriptionResult> Observer = new FaultsObserver<CimSubscriptionResult>(this);
    IDisposable Disposeable = Events.Subscribe(Observer);
}   

接下来,实现将在每次生成新事件时调用OnNext () 方法的观察程序。Next, implement an Observer whose OnNext() method will be invoked whenever a new event is generated.

每个事件都包含ChangeType ,用于指示是否正在创建、删除或更新错误以及相关FaultIdEach event contains ChangeType indicating whether a fault is being created, removed, or updated, and the relevant FaultId.

此外,它们还包含错误本身的所有属性。In addition, they contain all the properties of the fault itself.

class FaultsObserver : IObserver
{
    public void OnNext(T Event)
    {
        // Cast
        CimSubscriptionResult SubscriptionResult = Event as CimSubscriptionResult;

        if (SubscriptionResult != null)
        {
            // Unpack            
            CimKeyedCollection<CimProperty> Properties = SubscriptionResult.Instance.CimInstanceProperties;
            String ChangeType = Properties["ChangeType"].Value.ToString();
            String FaultId = Properties["FaultId"].Value.ToString();

            // Create
            if (ChangeType == "0")
            {
                Fault MyNewFault = new MyFault(SubscriptionResult.Instance);
                // TODO: Whatever you want!
            }
            // Remove
            if (ChangeType == "1")
            {
                // TODO: Use FaultId to find and delete whatever representation you have...
            }
            // Update
            if (ChangeType == "2")
            {
                // TODO: Use FaultId to find and modify whatever representation you have...
            }
        }
    }
    public void OnError(Exception e)
    {
        // Handle Exceptions
    }
    public void OnCompleted()
    {
        // Nothing
    }
}

了解故障生命周期Understand fault lifecycle

故障不会被标记为 "已查看" 或由用户解决。Faults are not intended to be marked "seen" or resolved by the user. 它们是在运行状况服务观察到问题时创建的,并且仅当运行状况服务无法再发现问题时,才会被自动删除。They are created when the Health Service observes a problem, and they are removed automatically and only when the Health Service can no longer observe the problem. 通常,这反映了问题已修复。In general, this reflects that the problem has been fixed.

但是,在某些情况下,运行状况服务重新发现可能会导致故障(例如,故障转移后或由于间歇性连接等)。However, in some cases, faults may be rediscovered by the Health Service (e.g. after failover, or due to intermittent connectivity, etc.). 出于此原因,保留自己的错误表示可能有意义,因此可以轻松删除重复。For this reason, it may makes sense to persist your own representation of faults, so you can easily deduplicate. 如果发送电子邮件警报或等效项,则这一点特别重要。This is especially important if you send email alerts or equivalent.

错误属性Properties of faults

此表显示了错误对象的几个关键属性。This table presents several key properties of the fault object. 对于完整的架构,请在storagewmi中检查MSFT @ no__t 1StorageDiagnoseResult类。For the full schema, inspect the MSFT_StorageDiagnoseResult class in storagewmi.mof.

PropertyProperty 示例Example
FaultIdFaultId {12345-12345-12345-12345-12345}
FaultTypeFaultType FaultType (正常)Microsoft.Health.FaultType.Volume.Capacity
ReasonReason "卷的可用空间不足。""The volume is running out of available space."
PerceivedSeverityPerceivedSeverity 55
FaultingObjectDescriptionFaultingObjectDescription Contoso XYZ9000 S.N。Contoso XYZ9000 S.N. 123456789123456789
FaultingObjectLocationFaultingObjectLocation 机架 A06,RU 25,槽11Rack A06, RU 25, Slot 11
RecommendedActionsRecommendedActions {"展开卷。","将工作负荷迁移到其他卷"。{"Expand the volume.", "Migrate workloads to other volumes."}

FaultId在一个群集范围内是唯一的。FaultId Unique within the scope of one cluster.

PerceivedSeverityPerceivedSeverity = {4,5,6} = {"信息"、"警告" 和 "错误"} 或等效的颜色,如蓝色、黄色和红色。PerceivedSeverity PerceivedSeverity = { 4, 5, 6 } = { "Informational", "Warning", and "Error" }, or equivalent colors such as blue, yellow, and red.

FaultingObjectDescription硬件的部分信息,通常为空白。FaultingObjectDescription Part information for hardware, typically blank for software objects.

FaultingObjectLocation硬件的位置信息,通常为空白(对于软件对象)。FaultingObjectLocation Location information for hardware, typically blank for software objects.

RecommendedActions建议的操作的列表,这些操作是独立的,不是特定的顺序。RecommendedActions List of recommended actions, which are independent and in no particular order. 现在,此列表的长度通常为1。Today, this list is often of length 1.

错误事件的属性Properties of fault events

此表显示了错误事件的几个关键属性。This table presents several key properties of the fault event. 对于完整的架构,请在storagewmi中检查MSFT @ no__t 1StorageFaultEvent类。For the full schema, inspect the MSFT_StorageFaultEvent class in storagewmi.mof.

请注意ChangeType,它指示是否正在创建、删除或更新错误,以及FaultIdNote the ChangeType, which indicates whether a fault is being created, removed, or updated, and the FaultId. 事件还包含受影响的错误的所有属性。An event also contains all the properties of the affected fault.

PropertyProperty 示例Example
ChangeTypeChangeType 00
FaultIdFaultId {12345-12345-12345-12345-12345}
FaultTypeFaultType FaultType (正常)Microsoft.Health.FaultType.Volume.Capacity
ReasonReason "卷的可用空间不足。""The volume is running out of available space."
PerceivedSeverityPerceivedSeverity 55
FaultingObjectDescriptionFaultingObjectDescription Contoso XYZ9000 S.N。Contoso XYZ9000 S.N. 123456789123456789
FaultingObjectLocationFaultingObjectLocation 机架 A06,RU 25,槽11Rack A06, RU 25, Slot 11
RecommendedActionsRecommendedActions {"展开卷。","将工作负荷迁移到其他卷"。{"Expand the volume.", "Migrate workloads to other volumes."}

ChangeTypeChangeType = {0,1,2} = {"创建","删除","更新"}。ChangeType ChangeType = { 0, 1, 2 } = { "Create", "Remove", "Update" }.

覆盖范围Coverage

在 Windows Server 2016 中,运行状况服务提供以下故障范围:In Windows Server 2016, the Health Service provides the following fault coverage:

PhysicalDisk (8)PhysicalDisk (8)

FaultTypeFaultType. PhysicalDisk. FailedMediaFaultType: Microsoft.Health.FaultType.PhysicalDisk.FailedMedia

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘出现故障。"Reason: "The physical disk has failed."
  • RecommendedAction: "替换物理磁盘。"RecommendedAction: "Replace the physical disk."

FaultTypeFaultType. PhysicalDisk. LostCommunicationFaultType: Microsoft.Health.FaultType.PhysicalDisk.LostCommunication

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘的连接已丢失。"Reason: "Connectivity has been lost to the physical disk."
  • RecommendedAction: "检查物理磁盘是否正常运行且已正确连接。RecommendedAction: "Check that the physical disk is working and properly connected."

FaultTypeFaultType. PhysicalDisk。FaultType: Microsoft.Health.FaultType.PhysicalDisk.Unresponsive

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘会定期无响应。"Reason: "The physical disk is exhibiting recurring unresponsiveness."
  • RecommendedAction: "替换物理磁盘。"RecommendedAction: "Replace the physical disk."

FaultTypeFaultType. PhysicalDisk. PredictiveFailureFaultType: Microsoft.Health.FaultType.PhysicalDisk.PredictiveFailure

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘出现故障,很快就会发生。"Reason: "A failure of the physical disk is predicted to occur soon."
  • RecommendedAction: "替换物理磁盘。"RecommendedAction: "Replace the physical disk."

FaultTypeFaultType. PhysicalDisk. UnsupportedHardwareFaultType: Microsoft.Health.FaultType.PhysicalDisk.UnsupportedHardware

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘被隔离,因为它不受解决方案供应商支持。"Reason: "The physical disk is quarantined because it is not supported by your solution vendor."
  • RecommendedAction: "将物理磁盘替换为支持的硬件"。RecommendedAction: "Replace the physical disk with supported hardware."

FaultTypeFaultType. PhysicalDisk. UnsupportedFirmwareFaultType: Microsoft.Health.FaultType.PhysicalDisk.UnsupportedFirmware

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘处于隔离区,因为其固件版本不受解决方案供应商支持。"Reason: "The physical disk is in quarantine because its firmware version is not supported by your solution vendor."
  • RecommendedAction: "将物理磁盘上的固件更新为目标版本"。RecommendedAction: "Update the firmware on the physical disk to the target version."

FaultTypeFaultType. PhysicalDisk. UnrecognizedMetadataFaultType: Microsoft.Health.FaultType.PhysicalDisk.UnrecognizedMetadata

  • 严重性:警告Severity: Warning
  • 原因: "物理磁盘具有无法识别的元数据。"Reason: "The physical disk has unrecognised meta data."
  • RecommendedAction: "此磁盘可能包含未知存储池中的数据。首先,请确保此磁盘上没有有用的数据,然后重置磁盘。 "RecommendedAction: "This disk may contain data from an unknown storage pool. First make sure there is no useful data on this disk, then reset the disk."

FaultTypeFaultType. PhysicalDisk. FailedFirmwareUpdateFaultType: Microsoft.Health.FaultType.PhysicalDisk.FailedFirmwareUpdate

  • 严重性:警告Severity: Warning
  • 原因: "尝试更新物理磁盘上的固件失败"。Reason: "Failed attempt to update firmware on the physical disk."
  • RecommendedAction: "尝试使用其他固件二进制文件。"RecommendedAction: "Try using a different firmware binary."

虚拟磁盘(2)Virtual Disk (2)

FaultTypeFaultType. VirtualDisks. NeedsRepairFaultType: Microsoft.Health.FaultType.VirtualDisks.NeedsRepair

  • 严重性:信息性Severity: Informational
  • 原因: "此卷上的某些数据不能完全复原。它仍可访问。 "Reason: "Some data on this volume is not fully resilient. It remains accessible."
  • RecommendedAction: "还原数据的复原能力"。RecommendedAction: "Restoring resiliency of the data."

FaultTypeFaultType. VirtualDisks。FaultType: Microsoft.Health.FaultType.VirtualDisks.Detached

  • 严重性:关键Severity: Critical
  • 原因: "卷不可访问。某些数据可能丢失。 "Reason: "The volume is inaccessible. Some data may be lost."
  • RecommendedAction: "检查所有存储设备的物理和/或网络连接。可能需要从备份还原。RecommendedAction: "Check the physical and/or network connectivity of all storage devices. You may need to restore from backup."

池容量(1)Pool Capacity (1)

FaultTypeFaultType. StoragePool. InsufficientReserveCapacityFaultFaultType: Microsoft.Health.FaultType.StoragePool.InsufficientReserveCapacityFault

  • 严重性:警告Severity: Warning
  • 原因: "存储池没有建议的最低预留容量。这可能会限制在出现驱动器故障时还原数据复原的能力。 "Reason: "The storage pool does not have the minimum recommended reserve capacity. This may limit your ability to restore data resiliency in the event of drive failure(s)."
  • RecommendedAction: "将额外容量添加到存储池,或释放容量。建议的最小保留保留因部署而异,但大约为2个驱动器的容量。 "RecommendedAction: "Add additional capacity to the storage pool, or free up capacity. The minimum recommended reserve varies by deployment, but is approximately 2 drives' worth of capacity."

卷容量(2) 1Volume Capacity (2)1

FaultTypeFaultType (正常)FaultType: Microsoft.Health.FaultType.Volume.Capacity

  • 严重性:警告Severity: Warning
  • 原因: "卷的可用空间不足。"Reason: "The volume is running out of available space."
  • RecommendedAction: "扩展卷或将工作负荷迁移到其他卷"。RecommendedAction: "Expand the volume or migrate workloads to other volumes."

FaultTypeFaultType (正常)FaultType: Microsoft.Health.FaultType.Volume.Capacity

  • 严重性:关键Severity: Critical
  • 原因: "卷的可用空间不足。"Reason: "The volume is running out of available space."
  • RecommendedAction: "扩展卷或将工作负荷迁移到其他卷"。RecommendedAction: "Expand the volume or migrate workloads to other volumes."

服务器(3)Server (3)

FaultTypeFaultType 已关闭。FaultType: Microsoft.Health.FaultType.Server.Down

  • 严重性:关键Severity: Critical
  • 原因: "无法连接到服务器。"Reason: "The server cannot be reached."
  • RecommendedAction: "启动或替换服务器"。RecommendedAction: "Start or replace server."

FaultTypeFaultType (独立)FaultType: Microsoft.Health.FaultType.Server.Isolated

  • 严重性:关键Severity: Critical
  • 原因:由于连接问题,服务器与群集隔离。 "Reason: "The server is isolated from the cluster due to connectivity issues."
  • RecommendedAction: "如果隔离仍然存在,请检查网络或将工作负荷迁移到其他节点。"RecommendedAction: "If isolation persists, check the network(s) or migrate workloads to other nodes."

FaultTypeFaultType 已隔离的FaultType: Microsoft.Health.FaultType.Server.Quarantined

  • 严重性:关键Severity: Critical
  • 原因: "由于重复失败,该服务器已由群集隔离。"Reason: "The server is quarantined by the cluster due to recurring failures."
  • RecommendedAction: "替换服务器或修复网络"。RecommendedAction: "Replace the server or fix the network."

群集(1)Cluster (1)

FaultTypeFaultType. ClusterQuorumWitness. 错误FaultType: Microsoft.Health.FaultType.ClusterQuorumWitness.Error

  • 严重性:关键Severity: Critical
  • 原因: "群集是一台服务器故障。"Reason: "The cluster is one server failure away from going down."
  • RecommendedAction: "检查见证服务器资源,并根据需要重新启动。启动或替换失败的服务器。 "RecommendedAction: "Check the witness resource, and restart as needed. Start or replace failed servers."

网络适配器/接口(4)Network Adapter/Interface (4)

FaultTypeFaultType. 网络适配器。FaultType: Microsoft.Health.FaultType.NetworkAdapter.Disconnected

  • 严重性:警告Severity: Warning
  • 原因: "网络接口已断开连接"。Reason: "The network interface has become disconnected."
  • RecommendedAction: "重新连接网络电缆"。RecommendedAction: "Reconnect the network cable."

FaultTypeFaultType. NetworkInterface。缺少FaultType: Microsoft.Health.FaultType.NetworkInterface.Missing

  • 严重性:警告Severity: Warning
  • 原因: "服务器 {server} 缺少连接到群集网络的网络适配器 {群集网络}"。Reason: "The server {server} has missing network adapter(s) connected to cluster network {cluster network}."
  • RecommendedAction: "将服务器连接到缺少的群集网络"。RecommendedAction: "Connect the server to the missing cluster network."

FaultTypeFaultType. 网络适配器. 硬件FaultType: Microsoft.Health.FaultType.NetworkAdapter.Hardware

  • 严重性:警告Severity: Warning
  • 原因: "网络接口发生硬件故障。"Reason: "The network interface has had a hardware failure."
  • RecommendedAction: "替换网络接口适配器。"RecommendedAction: "Replace the network interface adapter."

FaultTypeFaultType. 网络适配器。FaultType: Microsoft.Health.FaultType.NetworkAdapter.Disabled

  • 严重性:警告Severity: Warning
  • 原因: "网络接口 {network interface} 未启用且未在使用中。"Reason: "The network interface {network interface} is not enabled and is not being used."
  • RecommendedAction: "启用网络接口"。RecommendedAction: "Enable the network interface."

机箱(6)Enclosure (6)

FaultTypeFaultType. StorageEnclosure. LostCommunicationFaultType: Microsoft.Health.FaultType.StorageEnclosure.LostCommunication

  • 严重性:警告Severity: Warning
  • 原因: "对存储机箱的通信已丢失。"Reason: "Communication has been lost to the storage enclosure."
  • RecommendedAction: "启动或替换存储机箱。"RecommendedAction: "Start or replace the storage enclosure."

FaultTypeFaultType. StorageEnclosure. FanErrorFaultType: Microsoft.Health.FaultType.StorageEnclosure.FanError

  • 严重性:警告Severity: Warning
  • 原因: "存储机箱的位置 {position} 处的风扇出现故障。"Reason: "The fan at position {position} of the storage enclosure has failed."
  • RecommendedAction: "更换存储机箱中的风扇。"RecommendedAction: "Replace the fan in the storage enclosure."

FaultTypeFaultType. StorageEnclosure. CurrentSensorErrorFaultType: Microsoft.Health.FaultType.StorageEnclosure.CurrentSensorError

  • 严重性:警告Severity: Warning
  • 原因: "存储机箱的当前传感器位置 {position} 已失败。"Reason: "The current sensor at position {position} of the storage enclosure has failed."
  • RecommendedAction: "替换存储机箱中的当前传感器。"RecommendedAction: "Replace a current sensor in the storage enclosure."

FaultTypeFaultType. StorageEnclosure. VoltageSensorErrorFaultType: Microsoft.Health.FaultType.StorageEnclosure.VoltageSensorError

  • 严重性:警告Severity: Warning
  • 原因: "存储机箱的位置 {position} 处的电压传感器出现故障。"Reason: "The voltage sensor at position {position} of the storage enclosure has failed."
  • RecommendedAction: "更换存储机箱中的电压传感器。"RecommendedAction: "Replace a voltage sensor in the storage enclosure."

FaultTypeFaultType. StorageEnclosure. IoControllerErrorFaultType: Microsoft.Health.FaultType.StorageEnclosure.IoControllerError

  • 严重性:警告Severity: Warning
  • 原因: "存储机箱的位置 {position} 处的 IO 控制器出现故障。"Reason: "The IO controller at position {position} of the storage enclosure has failed."
  • RecommendedAction: "替换存储机箱中的 IO 控制器。"RecommendedAction: "Replace an IO controller in the storage enclosure."

FaultTypeFaultType. StorageEnclosure. TemperatureSensorErrorFaultType: Microsoft.Health.FaultType.StorageEnclosure.TemperatureSensorError

  • 严重性:警告Severity: Warning
  • 原因: "存储机箱的位置 {position} 处的温度传感器出现故障。"Reason: "The temperature sensor at position {position} of the storage enclosure has failed."
  • RecommendedAction: "更换存储机箱中的温度传感器。RecommendedAction: "Replace a temperature sensor in the storage enclosure."

固件推出(3)Firmware Rollout (3)

FaultTypeFaultType. FaultDomain. FailedMaintenanceModeFaultType: Microsoft.Health.FaultType.FaultDomain.FailedMaintenanceMode

  • 严重性:警告Severity: Warning
  • 原因: "目前无法在执行固件推出时进行进度"。Reason: "Currently unable to make progress while performing firmware roll out."
  • RecommendedAction: "验证所有存储空间是否正常,以及没有容错域当前是否处于维护模式。"RecommendedAction: "Verify all storage spaces are healthy, and that no fault domain is currently in maintenance mode."

FaultTypeFaultType. FaultDomain. FirmwareVerifyVersionFaileFaultType: Microsoft.Health.FaultType.FaultDomain.FirmwareVerifyVersionFaile

  • 严重性:警告Severity: Warning
  • 原因:由于在应用固件更新后,固件回滚已被取消或意外的固件版本信息取消。 "Reason: "Firmware roll out was cancelled due to unreadable or unexpected firmware version information after applying a firmware update."
  • RecommendedAction: "固件问题解决后重新启动固件推出"。RecommendedAction: "Restart firmware roll out once the firmware issue has been resolved."

FaultTypeFaultType. FaultDomain. TooManyFailedUpdatesFaultType: Microsoft.Health.FaultType.FaultDomain.TooManyFailedUpdates

  • 严重性:警告Severity: Warning
  • 原因:由于物理磁盘太多而导致固件更新尝试失败,因此已取消固件回滚。 "Reason: "Firmware roll out was cancelled due to too many physical disks failing a firmware update attempt."
  • RecommendedAction: "固件问题解决后重新启动固件推出"。RecommendedAction: "Restart firmware roll out once the firmware issue has been resolved."

存储 QoS (3) 2Storage QoS (3)2

FaultTypeFaultType. StorQos. InsufficientThroughputFaultType: Microsoft.Health.FaultType.StorQos.InsufficientThroughput

  • 严重性:警告Severity: Warning
  • 原因: "存储吞吐量不足以满足预留。"Reason: "Storage throughput is insufficient to satisfy reserves."
  • RecommendedAction: "重新配置存储 QoS 策略"。RecommendedAction: "Reconfigure Storage QoS policies."

FaultTypeFaultType. StorQos. LostCommunicationFaultType: Microsoft.Health.FaultType.StorQos.LostCommunication

  • 严重性:警告Severity: Warning
  • 原因: "存储 QoS 策略管理器已失去与该卷的通信。"Reason: "The Storage QoS policy manager has lost communication with the volume."
  • RecommendedAction: "请重新启动节点 {节点}"RecommendedAction: "Please reboot nodes {nodes}"

FaultTypeFaultType. StorQos. MisconfiguredFlowFaultType: Microsoft.Health.FaultType.StorQos.MisconfiguredFlow

  • 严重性:警告Severity: Warning
  • 原因: "一个或多个存储使用者(通常是虚拟机)正在使用 id 为 {id} 的不存在的策略。"Reason: "One or more storage consumers (usually Virtual Machines) are using a non-existent policy with id {id}."
  • RecommendedAction: "重新创建任何缺少的存储 QoS 策略。"RecommendedAction: "Recreate any missing Storage QoS policies."

1表示卷已达到 80% (次严重性)或 90% (主要严重性)。1 Indicates the volume has reached 80% full (minor severity) or 90% full (major severity).
2指示卷上的某些 .vhd 未达到其最小 IOPS,超过 10% (次)、30% (主要)或 50% (严重)滚动时间为24小时。2 Indicates some .vhd(s) on the volume have not met their Minimum IOPS for over 10% (minor), 30% (major), or 50% (critical) of rolling 24-hour window.

备注

存储机箱组件(如风扇、电源和传感器)的运行状况派生自 SCSI 机箱服务 (SES)。The health of storage enclosure components such as fans, power supplies, and sensors is derived from SCSI Enclosure Services (SES). 如果你的供应商不提供此信息,运行状况服务不能对其进行显示。If your vendor does not provide this information, the Health Service cannot display it.

请参阅See also