您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

排查进程服务器问题Troubleshoot the process server

将本地 VMware VM 和物理服务器设置为灾难恢复到 Azure 时,将使用 Site Recovery 进程服务器。The Site Recovery process server is used when you set up disaster recovery to Azure for on-premises VMware VMs and physical servers. 本文介绍如何排查进程服务器问题,包括复制和连接问题。This article describes how to troubleshoot issues with the process server, including replication and connectivity issues.

详细了解进程服务器。Learn more about the process server.

开始之前Before you start

在开始故障排除之前:Before you start troubleshooting:

  1. 请确保了解如何监视进程服务器Make sure you understand how to monitor process servers.
  2. 查看下面的最佳做法。Review the best practices below.
  3. 确保遵循容量注意事项,并使用配置服务器独立进程服务器的大小调整指导。Make sure you follow capacity considerations, and use sizing guidance for the configuration server or standalone process servers.

进程服务器部署最佳做法Best practices for process server deployment

为使进程服务器提供最佳性能,我们汇总了一些常规最佳做法。For optimum performance of process servers, we've summarized a number of general best practices.

最佳做法Best practice 详细信息Details
使用情况Usage 确保只将配置服务器/独立进程服务器用于目标用途。Make sure the configuration server/standalone process server are only used for the intended purpose. 不要在该计算机上运行任何其他组件。Don't run anything else on the machine.
IP 地址IP address 确保进程服务器使用静态 IPv4 地址,且未配置 NAT。Make sure that the process server has a static IPv4 address, and doesn't have NAT configured.
控制内存/CPU 使用率Control memory/CPU usage 将 CPU 和内存使用率保持在 70% 以下。Keep CPU and memory usage under 70%.
确保有足够的可用空间Ensure free space 可用空间是指进程服务器上的缓存磁盘空间。Free space refers to the cache disk space on the process server. 复制数据在上传到 Azure 之前会存储在缓存中。Replication data is stored in the cache before it's uploaded to Azure.

请将可用空间保持在 25% 以上。Keep free space above 25%. 如果低于 20%,与进程服务器关联的复制计算机的复制会受到限制。If it goes below 20%, replication is throttled for replicated machines that are associated with the process server.

检查进程服务器的运行状况Check process server health

故障排除的第一个步骤是检查进程服务器的运行状况和状态。The first step in troubleshooting is to check the health and status of the process server. 为此,请查看所有警报,检查所需的服务是否正在运行,并验证进程服务器是否发出了检测信号。To do this, review all alerts, check that required services are running, and verify that there's a heartbeat from the process server. 下图汇总了这些步骤,其后的过程可帮助你执行这些步骤。These steps are summarized in the following graphic, followed by procedures to help you perform the steps.

排查进程服务器运行状况问题

步骤 1:排查进程服务器运行状况警报Step 1: Troubleshoot process server health alerts

进程服务器会生成一些运行状况警报。The process server generates a number of health alerts. 下表汇总了这些警报和建议的操作。These alerts, and recommended actions, are summarized in the following table.

警报类型Alert type 错误Error 故障排除Troubleshoot
状态良好 None 进程服务器已连接且正常运行。Process server is connected and healthy.
警告 指定的服务未运行。Specified services aren't running. 1.检查服务是否正在运行。1. Check that services are running.
2.如果服务已按预期方式运行,请遵照以下说明排查连接和复制问题2. If services are running as expected, follow the instructions below to troubleshoot connectivity and replication issues.
警告 过去 15 分钟的 CPU 利用率超过 80%。CPU utilization > 80% for the last 15 minutes. 1.不要添加新计算机。1. Don't add new machines.
2.检查使用进程服务器的 VM 数目是否符合定义的限制,并考虑设置额外的进程服务器2. Check that the number of VMs using the process server aligns to defined limits, and consider setting up an additional process server.
3.遵照以下说明排查连接和复制问题3. Follow the instructions below to troubleshoot connectivity and replication issues.
关键 过去 15 分钟的 CPU 利用率超过 95%。CPU utilization > 95% for the last 15 minutes. 1.不要添加新计算机。1. Don't add new machines.
2.检查使用进程服务器的 VM 数目是否符合定义的限制,并考虑设置额外的进程服务器2. Check that the number of VMs using the process server aligns to defined limits, and consider setting up an additional process server.
3.遵照以下说明排查连接和复制问题3. Follow the instructions below to troubleshoot connectivity and replication issues.
4.如果问题持续出现,请针对 VMware/物理服务器复制运行部署规划器4. If the issue persists, run the Deployment Planner for VMware/physical server replication.
警告 过去 15 分钟的内存使用率超过 80%。Memory usage > 80% for the last 15 minutes. 1.不要添加新计算机。1. Don't add new machines.
2.检查使用进程服务器的 VM 数目是否符合定义的限制,并考虑设置额外的进程服务器2. Check that the number of VMs using the process server aligns to defined limits, and consider setting up an additional process server.
3.遵照警告相关的任何说明操作。3. Follow any instructions associated with the warning.
4.如果问题持续出现,请遵照以下说明排查连接和复制问题4. If the issue persists, follow the instructions below to troubleshoot connectivity and replication issues.
关键 过去 15 分钟的内存使用率超过 95%。Memory usage > 95% for the last 15 minutes. 1.不要添加新计算机,并考虑设置额外的进程服务器1. Don't add new machines, and considering setting up an additional process server.
2.遵照警告相关的任何说明操作。2. Follow any instructions associated with the warning.
3. 4.3. 4. 如果问题持续出现,请遵照以下说明排查连接和复制问题If the issue continues, follow the instructions below to troubleshoot connectivity and replication issues.
4.如果问题持续出现,请针对 VMware/物理服务器复制问题运行部署规划器4. If the issue persists, run the Deployment Planner for VMware/physical server replication issues.
警告 过去 15 分钟的缓存文件夹可用空间小于 30%。Cache folder free space < 30% for the last 15 minutes. 1.不要添加新计算机,并考虑设置额外的进程服务器1. Don't add new machines, and consider setting up an additional process server.
2.检查使用进程服务器的 VM 数量是否符合指导原则2. Check that the number of VMs using the process server aligns to guidelines.
3.遵照以下说明排查连接和复制问题3. Follow the instructions below to troubleshoot connectivity and replication issues.
关键 过去 15 分钟的可用空间小于 25%。Free space < 25% for last 15 minutes 1.遵照警告相关的说明来解决问题。1. Follow the instructions associated with the warning for this issue.
2. 3.2. 3. 遵照以下说明排查连接和复制问题Follow the instructions below to troubleshoot connectivity and replication issues.
3.如果问题持续出现,请针对 VMware/物理服务器复制运行部署规划器3. If the issue persists, run the Deployment Planner for VMware/physical server replication.
关键 进程服务器有 15 分钟或更长时间未发出检测信号。No heartbeat from the process server for 15 minutes or more. Tmansvs 服务未与配置服务器通信。The tmansvs service isn't communicating with the configuration server. 1) 检查进程服务器是否已启动并运行。1) Check that the process server is up and running.
2.检查 tmassvc 是否在进程服务器上运行。2. Check that the tmassvc is running on the process server.
3.遵照以下说明排查连接和复制问题3. Follow the instructions below to troubleshoot connectivity and replication issues.

表键

步骤 2:检查进程服务器服务Step 2: Check process server services

下表汇总了应在进程服务器上运行的服务。Services that should be running on the process server are summarized in the following table. 根据进程服务器的部署方式,这些服务存在细微的差别。There are slight differences in services, depending on how the process server is deployed.

对于除 Microsoft Azure 恢复服务代理 (obengine) 以外的所有服务,请检查“启动类型”是否设置为“自动”或“自动(延迟启动)”。For all services except the Microsoft Azure Recovery Services Agent (obengine), check that the StartType is set to Automatic or Automatic (Delayed Start).

部署Deployment 正在运行的服务Running services
配置服务器上的进程服务器Process server on the configuration server Scaleout-processserverProcessServerMonitor;cxprocessserverInMage PushInstall;日志上载服务(LogUpload);InMage Scout 应用程序服务;Microsoft Azure 恢复服务代理(obengine);InMage Scout VX Agent-Sentinel/Outpost (svagents);tmansvcWorld Wide Web 发布服务(W3SVC);MySQLMicrosoft Azure Site Recovery 服务(dra)ProcessServer; ProcessServerMonitor; cxprocessserver; InMage PushInstall; Log Upload Service (LogUpload); InMage Scout Application Service; Microsoft Azure Recovery Services Agent (obengine); InMage Scout VX Agent-Sentinel/Outpost (svagents); tmansvc; World Wide Web Publishing Service (W3SVC); MySQL; Microsoft Azure Site Recovery Service (dra)
作为独立服务器运行的进程服务器Process server running as a standalone server ProcessServer;ProcessServerMonitor;cxprocessserver;InMage PushInstall;Log Upload Service (LogUpload);InMage Scout Application Service;Microsoft Azure 恢复服务代理 (obengine);InMage Scout VX Agent-Sentinel/Outpost (svagents);tmansvc。ProcessServer; ProcessServerMonitor; cxprocessserver; InMage PushInstall; Log Upload Service (LogUpload); InMage Scout Application Service; Microsoft Azure Recovery Services Agent (obengine); InMage Scout VX Agent-Sentinel/Outpost (svagents); tmansvc.
Azure 中部署的用于故障回复的进程服务器Process server deployed in Azure for failback ProcessServer;ProcessServerMonitor;cxprocessserver;InMage PushInstall;Log Upload Service (LogUpload)ProcessServer; ProcessServerMonitor; cxprocessserver; InMage PushInstall; Log Upload Service (LogUpload)

步骤 3:检查进程服务器的检测信号Step 3: Check the process server heartbeat

如果进程服务器未发出检测信号(错误代码 806),请执行以下操作:If there's no heartbeat from the process server (error code 806), do the following:

  1. 验证进程服务器 VM 是否已启动并运行。Verify that the process server VM is up and running.

  2. 检查以下日志中的错误。Check these logs for errors.

    C:\ProgramData\ASR\home\svsystems\eventmanager .log C\ProgramData\ASR\home\svsystems\monitor_protection.logC:\ProgramData\ASR\home\svsystems\eventmanager .log C\ProgramData\ASR\home\svsystems\monitor_protection.log

检查连接和复制Check connectivity and replication

初始和进行中的复制失败往往是源计算机与进程服务器或者进程服务器与 Azure 之间的连接问题造成的。Initial and ongoing replication failures are often caused by connectivity issues between source machines and the process server, or between the process server and Azure. 下图汇总了这些步骤,其后的过程可帮助你执行这些步骤。These steps are summarized in the following graphic, followed by procedures to help you perform the steps.

排查连接和复制问题

步骤 4:检查源计算机上的时间同步Step 4: Verify time sync on source machine

确保复制的计算机的系统日期/时间已同步。了解详细信息Ensure that the system date/time for the replicated machine is in sync. Learn more

步骤 5:检查源计算机上的防病毒软件Step 5: Check anti-virus software on source machine

检查复制的计算机上是否没有任何防病毒软件正在阻止 Site Recovery。Check that no anti-virus software on the replicated machine is blocking Site Recovery. 如果需要从防病毒程序中排除 Site Recovery,请查看此文If you need to exclude Site Recovery from anti-virus programs, review this article.

步骤 6:检查从源计算机建立的连接Step 6: Check connectivity from source machine

  1. 如果需要,请在源计算机上安装 Telnet 客户端Install the Telnet client on the source machine if you need to. 不要使用 Ping。Don't use Ping.

  2. 在源计算机上,使用 Telnet 对 HTTPS 端口上的进程服务器运行 ping。From the source machine, ping the process server on the HTTPS port with Telnet. 默认情况下,9443 是复制流量使用的 HTTPS 端口。By default 9443 is the HTTPS port for replication traffic.

    telnet <process server IP address> <port>

  3. 验证连接是否成功。Verify whether the connection is successful.

连接Connectivity 详细信息Details ActionAction
成功Successful Telnet 会显示空白屏幕,并且可以访问进程服务器。Telnet shows a blank screen, and the process server is reachable. 无需进一步的操作。No further action required.
不成功Unsuccessful 无法连接You can't connect 请确保进程服务器上允许入站端口 9443。Make sure that inbound port 9443 is allowed on the process server. 例如,你使用了外围网络或屏蔽子网。For example, if you have a perimeter network or a screened subnet. 再次检查连接。Check connectivity again.
部分成功Partially successful 可以连接,但源计算机报告无法访问进程服务器。You can connect, but the source machine reports that the process server can't be reached. 继续执行下一个故障排除过程。Continue with the next troubleshooting procedure.

步骤 7:排查无法访问进程服务器的问题Step 7: Troubleshoot an unreachable process server

如果无法从源计算机访问进程服务器,会显示错误 78186。If the process server isn't reachable from the source machine, error 78186 will be displayed. 如果不解决此问题,它会导致无法按预期方式生成应用一致性和崩溃一致性恢复点。If not addressed, this issue will lead to both app-consistent and crash-consistent recovery points not being generated as expected.

故障排除方法是检查源计算机是否可以访问进程服务器的 IP 地址,并在源计算机上运行 cxpsclient 工具来检查端到端的连接。Troubleshoot by checking whether the source machine can reach the IP address of the process server, and run the cxpsclient tool on the source machine, to check the end-to-end connection.

检查进程服务器上的 IP 连接Check the IP connection on the process server

如果 telnet 成功,但源计算机报告无法访问进程服务器,请检查你是否可以访问进程服务器的 IP 地址。If telnet is successful but the source machine reports that the process server can't be reached, check whether you can reach the IP address of the process server.

  1. 在 Web 浏览器中,尝试访问 IP 地址 https://<进程服务器 IP>:<进程服务器数据端口>/。In a web browser, try to reach IP address https://<PS_IP>:<PS_Data_Port>/.
  2. 如果检查结果显示 HTTPS 证书错误,这是正常情况。If this check shows an HTTPS certificate error, that's normal. 如果忽略该错误,应会看到“400 - 错误的请求”。If you ignore the error, you should see a 400 - Bad Request. 这表示该服务器无法为浏览器请求提供服务,但标准的 HTTPS 连接正常。This means that the server can't serve the browser request, and that the standard HTTPS connection is fine.
  3. 如果无法运行此检查,请记下浏览器错误消息。If this check doesn't work, then note the browser error message. 例如,407 错误表示代理身份验证出现问题。For example, a 407 error will indicate an issue with proxy authentication.

使用 cxpsclient 检查连接Check the connection with cxpsclient

此外,可以运行 cxpsclient 工具来检查端到端的连接。Additionally, you can run the cxpsclient tool to check the end-to-end connection.

  1. 按如下所示运行该工具:Run the tool as follows:

    <install folder>\cxpsclient.exe -i <PS_IP> -l <PS_Data_Port> -y <timeout_in_secs:recommended 300>
    
  2. 在进程服务器上,检查以下文件夹中生成的日志:On the process server, check the generated logs in these folders:

    C:\ProgramData\ASR\home\svsystems\transport\log\cxps.err C:\ProgramData\ASR\home\svsystems\transport\log\cxps.xferC:\ProgramData\ASR\home\svsystems\transport\log\cxps.err C:\ProgramData\ASR\home\svsystems\transport\log\cxps.xfer

检查源 VM 日志中的上传错误(错误 78028)Check source VM logs for upload failures (error 78028)

阻止将数据从源计算机上传到进程服务的问题可能会导致无法生成崩溃一致性和应用一致性恢复点。Issue with data uploads blocked from source machines to the process service can result in both crash-consistent and app-consistent recovery points not being generated.

  1. 若要排查网络上传错误,可以查看以下日志中的错误:To troubleshoot network upload failures, you can look for errors in this log:

    C:\Program Files (x86)\Microsoft Azure Site Recovery\agent\svagents*.logC:\Program Files (x86)\Microsoft Azure Site Recovery\agent\svagents*.log

  2. 使用本文中的余下过程可帮助解决数据上传问题。Use the rest of the procedures in this article can help you to resolve data upload issues.

步骤 8:检查进程服务器是否正在推送数据Step 8: Check whether the process server is pushing data

检查进程服务器是否主动将数据推送到 Azure。Check whether the process server is actively pushing data to Azure.

  1. 在进程服务器上打开任务管理器(按 Ctrl+Shift+Esc)。On the process server, open Task Manager (press Ctrl+Shift+Esc).

  2. 选择“性能”选项卡> “打开资源监视器”。Select the Performance tab > Open Resource Monitor.

  3. 在“资源监视器”页中,选择“网络”选项卡。在 "具有网络活动的进程" 下,检查 cbengine.exe 是否正在主动发送大量数据。In Resource Monitor page, select the Network tab. Under Processes with Network Activity, check whether cbengine.exe is actively sending a large volume of data.

    存在网络活动的进程下的数量

如果 cbengine.exe 未发送大量数据,请完成以下部分中的步骤。If cbengine.exe isn't sending a large volume of data, complete the steps in the following sections.

步骤 9:检查进程服务器可以连接到 Azure Blob 存储Step 9: Check the process server connection to Azure blob storage

  1. 在资源监视器中选择“cbengine.exe”。In Resource Monitor, select cbengine.exe.
  2. 在“TCP 连接”下,检查进程服务器与 Azure 存储之间是否建立了连接。Under TCP Connections, check to see whether there is connectivity from the process server to the Azure storage.

cbengine.exe 与 Azure Blob 存储 URL 之间的连接

检查服务Check services

如果进程服务器与 Azure Blob 存储 URL 之间未建立连接,请检查服务是否正在运行。If there's no connectivity from the process server to the Azure blob storage URL, check that services are running.

  1. 在控制面板中选择“服务”。In the Control Panel, select Services.

  2. 检查以下服务是否正在运行:Verify that the following services are running:

    • cxprocessservercxprocessserver
    • InMage Scout VX Agent – Sentinel/OutpostInMage Scout VX Agent – Sentinel/Outpost
    • Microsoft Azure 恢复服务代理Microsoft Azure Recovery Services Agent
    • Microsoft Azure Site Recovery 服务Microsoft Azure Site Recovery Service
    • tmansvctmansvc
  3. 启动或重启未运行的任何服务。Start or restart any service that isn't running.

  4. 检查进程服务器是否已连接且可访问。Verify that the process server is connected and reachable.

步骤 10:检查进程服务器是否已连接到 Azure 公共 IP 地址Step 10: check the process server connection to Azure public IP address

  1. 在进程服务器上的 %programfiles%\Microsoft Azure Recovery Services Agent\Temp 中,打开最新的 CBEngineCurr.errlog 文件。On the process server, in %programfiles%\Microsoft Azure Recovery Services Agent\Temp, open the latest CBEngineCurr.errlog file.
  2. 在该文件中,搜索 443 或字符串 connection attempt failedIn the file, search for 443, or for the string connection attempt failed.

Temp 文件夹中的错误日志

  1. 如果看到了问题,请使用端口 443 在 CBEngineCurr.currLog 文件查找 Azure 公共 IP 地址:If you see issues, located your Azure public IP address in the CBEngineCurr.currLog file by using port 443:

telnet <your Azure Public IP address as seen in CBEngineCurr.errlog> 443

  1. 在进程服务器上的命令行下,使用 Telnet 来 ping Azure 公共 IP 地址。At the command line on the process server, use Telnet to ping your Azure public IP address.
  2. 如果无法连接,请遵循下一过程。If you can't connect, follow the next procedure.

步骤 11:检查进程服务器防火墙设置。Step 11: Check process server firewall settings.

检查进程服务器上的基于 IP 地址的防火墙是否阻止了访问。Check whether the IP address-based firewall on the process server is blocking access.

  1. 对于基于 IP 地址的防火墙规则:For IP address-based firewall rules:

    a)下载Microsoft Azure 数据中心 IP 范围的完整列表。a) Download the complete list of Microsoft Azure datacenter IP ranges.

    b) 将 IP 地址范围添加到防火墙配置,以确保防火墙允许与 Azure(以及默认的 HTTPS 端口 443)通信。b) Add the IP address ranges to your firewall configuration, to ensure that the firewall allows communication to Azure (and to the default HTTPS port, 443).

    c)允许订阅的 Azure 区域的 IP 地址范围,以及 Azure 美国西部区域的 IP 地址范围(用于访问控制和标识管理)。c) Allow IP address ranges for the Azure region of your subscription, and for the Azure West US region (used for access control and identity management).

  2. 对于基于 URL 的防火墙,请将下表中列出的 URL 添加到防火墙配置。For URL-based firewalls, add the URLs listed in the following table to the firewall configuration.

    名称Name 商用 URLCommercial URL 政府 URLGovernment URL 说明Description
    Azure Active DirectoryAzure Active Directory login.microsoftonline.com login.microsoftonline.us 由 Azure Active Directory 用于访问控制和标识管理。Used for access control and identity management by using Azure Active Directory.
    备份Backup *.backup.windowsazure.com *.backup.windowsazure.us 用于复制数据传输和协调。Used for replication data transfer and coordination.
    复制Replication *.hypervrecoverymanager.windowsazure.com *.hypervrecoverymanager.windowsazure.us 用于复制管理操作和协调。Used for replication management operations and coordination.
    存储Storage *.blob.core.windows.net *.blob.core.usgovcloudapi.net 用于访问存储所复制数据的存储帐户。Used for access to the storage account that stores replicated data.
    遥测(可选)Telemetry (optional) dc.services.visualstudio.com dc.services.visualstudio.com 用于遥测。Used for telemetry.
    时间同步Time synchronization time.windows.com time.nist.gov 用于检查所有部署中的系统时间与全球时间之间的时间同步。Used to check time synchronization between system and global time in all deployments.

步骤 12:检查进程服务器代理设置Step 12: Verify process server proxy settings

  1. 如果使用代理服务器,请确保代理服务器名称由 DNS 服务器解析。If you use a proxy server, ensure that the proxy server name is resolved by the DNS server. 在注册表项 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Azure Site Recovery\ProxySettings 中,检查设置配置服务器时提供的值。Check the value that you provided when you set up the configuration server in registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Azure Site Recovery\ProxySettings.

  2. 确保 Azure Site Recovery 代理使用相同的设置发送数据。Ensure that the same settings are used by the Azure Site Recovery agent to send data.

    a)搜索Microsoft Azure 备份a) Search for Microsoft Azure Backup.

    b)打开Microsoft Azure 备份,然后选择 "操作 > " "更改属性"。b) Open Microsoft Azure Backup, and select Action > Change Properties.

    c) 在“代理配置”选项卡上,代理地址应与注册表设置中显示的代理地址相同。c) On the Proxy Configuration tab, the proxy address should be same as the proxy address that's shown in the registry settings. 如果不同,请将其更改为相同的地址。If not, change it to the same address.

步骤 13:检查带宽Step 13: Check bandwidth

提高进程服务器与 Azure 之间的带宽,然后检查问题是否仍旧出现。Increase the bandwidth between the process server and Azure, and then check whether the problem still occurs.

后续步骤Next steps

如需更多帮助,请在 Azure Site Recovery 论坛中提问。If you need more help, post your question in the Azure Site Recovery forum.