您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

性能优化方案:分布式业务事务Performance tuning scenario: Distributed business transactions

本文介绍开发团队如何使用指标来找出瓶颈并改善分布式系统的性能。This article describes how a development team used metrics to find bottlenecks and improve the performance of a distributed system. 本文基于对示例应用程序执行的实际负载测试。The article is based on actual load testing that we did for a sample application. GitHub 上提供了应用程序代码。The application code is available on GitHub.

本文是序列的一部分。在 此处阅读第一部分。This article is part of a series. Read the first part here.

方案 :客户端应用程序启动涉及多个步骤的业务事务。Scenario : A client application initiates a business transaction that involves multiple steps.

此方案涉及在 Azure Kubernetes Service (AKS) 上运行的无人机传递应用程序。This scenario involves a drone delivery application that runs on Azure Kubernetes Service (AKS). 客户使用 web 应用来计划无人机的传递。Customers use a web app to schedule deliveries by drone. 每个事务需要多个步骤,这些步骤由后端的单独微服务执行:Each transaction requires multiple steps that are performed by separate microservices on the back end:

  • 交付服务管理交付。The Delivery service manages deliveries.
  • 无人机计划程序服务计划无人机用于分拣。The Drone Scheduler service schedules drones for pickup.
  • 包服务管理包。The Package service manages packages.

还有两个其他服务:一种引入服务,该服务接受客户端请求并将其放在队列中进行处理,而工作流服务用于协调工作流中的步骤。There are two other services: An Ingestion service that accepts client requests and puts them on a queue for processing, and a Workflow service that coordinates the steps in the workflow.


有关此方案的详细信息,请参阅 设计微服务体系结构For more information about this scenario, see Designing a microservices architecture.

测试1:基线Test 1: Baseline

对于第一个负载测试,该团队创建了一个6节点 AKS 群集,并为每个微服务部署了三个副本。For the first load test, the team created a 6-node AKS cluster and deployed three replicas of each microservice. 负载测试是一个步骤负载测试,从两个模拟用户开始,再斜向多达40个模拟用户。The load test was a step-load test, starting at two simulated users and ramping up to 40 simulated users.

设置Setting Value
群集节点Cluster nodes 66
PodPods 每个服务3个3 per service

下图显示了负载测试的结果,如 Visual Studio 中所示。The following graph shows the results of the load test, as shown in Visual Studio. 紫色线条绘制用户负载,橙色线条绘制请求总数。The purple line plots user load, and the orange line plots total requests.

Visual Studio 负载测试结果图形

首先要了解的是,每秒的客户端请求不是性能的有用指标。The first thing to realize about this scenario is that client requests per second is not a useful metric of performance. 这是因为应用程序以异步方式处理请求,因此客户端立即获取响应。That's because the application processes requests asynchronously, so the client gets a response right away. 响应代码始终 (接受 HTTP 202) ,这意味着请求已接受但处理不完整。The response code is always HTTP 202 (Accepted), meaning the request was accepted but processing is not complete.

我们真正想要了解的是后端是否与请求速率保持一致。What we really want to know is whether the backend is keeping up with the request rate. 服务总线队列可以吸收高峰,但如果后端无法处理持久的负载,处理将会进一步进一步。The Service Bus queue can absorb spikes, but if the backend cannot handle a sustained load, processing will fall further and further behind.

下面是一个更详细的图表。Here's a more informative graph. 它在服务总线队列中绘制传入消息和传出消息的数量。It plots the number incoming and outgoing messages on the Service Bus queue. 传入消息以浅蓝色显示,传出消息显示为深蓝色:Incoming messages are shown in light blue, and outgoing messages are shown in dark blue:


此图表显示传入消息的速率增加,达到峰值,然后在负载测试结束时返回到零。This chart is showing that the rate of incoming messages increases, reaching a peak and then dropping back to zero at the end of the load test. 但传出消息的数量会在测试的早期高峰,然后实际丢弃。But the number of outgoing messages peaks early in the test and then actually drops. 这意味着处理请求的工作流服务无法保持。That means the Workflow service, which handles the requests, isn't keeping up. 即使负载测试结束后 () 关系图上的9:22,仍会处理消息,因为工作流服务将继续排出队列。Even after the load test ends (around 9:22 on the graph), messages are still being processed as the Workflow service continues to drain the queue.

处理速度慢?What's slowing down the processing? 要查找的第一件事是错误或异常,可能指示系统问题。The first thing to look for is errors or exceptions that might indicate a systematic issue. Azure Monitor 中的 应用程序映射 显示了组件之间的调用关系图,并且是一种发现问题的快速方法,然后单击以获取更多详细信息。The Application Map in Azure Monitor shows the graph of calls between components, and is a quick way to spot issues and then click through to get more details.

当然,应用程序映射会显示工作流服务从传递服务中收到错误:Sure enough, the Application Map shows that the Workflow service is getting errors from the Delivery service:


若要查看更多详细信息,可以在关系图中选择一个节点,然后单击到端到端事务视图中。To see more details, you can select a node in the graph and click into an end-to-end transaction view. 在这种情况下,它会显示传递服务返回 HTTP 500 错误。In this case, it shows that the Delivery service is returning HTTP 500 errors. 错误消息指示由于 Azure Cache for Redis 中的内存限制,引发了异常。The error messages indicate that an exception is being thrown due to memory limits in Azure Cache for Redis.


你可能会注意到,对 Redis 的这些调用不会出现在应用程序映射中。You may notice that these calls to Redis don't appear in the Application Map. 这是因为 Application Insights 的 .NET 库不具有作为依赖项的跟踪 Redis 的内置支持。That's because the .NET library for Application Insights doesn't have built-in support for tracking Redis as a dependency. (提供现成支持的列表,请参阅 依赖项自动收集。 ) 作为回退,你可以使用 TrackDependency API 跟踪任何依赖项。(For a list of what's supported out of the box, see Dependency auto-collection.) As a fallback, you can use the TrackDependency API to track any dependency. 负载测试通常会在遥测中显示这些类型的间隙,这种情况可进行修正。Load testing often reveals these kinds of gaps in the telemetry, which can be remediated.

测试2:增加缓存大小Test 2: Increased cache size

对于第二个负载测试,开发团队增加了用于 Redis 的 Azure 缓存中的缓存大小。For the second load test, the development team increased the cache size in Azure Cache for Redis. (请参阅 如何为 Redis 缩放 Azure 缓存。 ) 此更改解决了内存不足异常,现在应用程序映射显示零个错误:(See How to Scale Azure Cache for Redis.) This change resolved the out-of-memory exceptions, and now the Application Map shows zero errors:


不过,处理消息仍会出现明显的延迟。However, there is still a dramatic lag in processing messages. 负载测试高峰期,传入消息速率超过5个 × 传出速率:At the peak of the load test, the incoming message rate is more than 5× the outgoing rate:


下图根据消息完成情况测量吞吐量,即 — ,工作流服务将服务总线消息标记为 "已完成" 的比率。The following graph measures throughput in terms of message completion — that is, the rate at which the Workflow service marks the Service Bus messages as completed. 关系图上的每个点都表示5秒的数据,显示最大吞吐量约为16秒。Each point on the graph represents 5 seconds of data, showing ~16/sec maximum throughput.


此图是通过使用 Kusto 查询语言在 Log Analytics 工作区中运行查询生成的:This graph was generated by running a query in the Log Analytics workspace, using the Kusto query language:

let start=datetime("2019-07-31T22:30:00.000Z");
let end=datetime("2019-07-31T22:45:00.000Z");
| where cloud_RoleName == 'fabrikam-workflow' 
| where timestamp > start and timestamp < end
| where type == 'Azure Service Bus' 
| where target has 'https://dev-i-iuosnlbwkzkau.servicebus.windows.net'
| where client_Type == "PC"
| where name == "Complete" 
| summarize succeeded=sumif(itemCount, success == true), failed=sumif(itemCount, success == false) by bin(timestamp, 5s)
| render timechart

测试3:横向扩展后端服务Test 3: Scale out the backend services

它显示后端是瓶颈。It appears the back end is the bottleneck. 下一步是扩展业务服务 (包、交付和无人机计划程序) ,并查看吞吐量是否提升。An easy next step is to scale out the business services (Package, Delivery, and Drone Scheduler), and see if throughput improves. 对于下一个负载测试,团队将这些服务从三个副本扩展到六个副本。For the next load test, the team scaled these services out from three replicas to six replicas.

设置Setting Value
群集节点Cluster nodes 66
引入服务Ingestion service 3 个副本3 replicas
工作流服务Workflow service 3 个副本3 replicas
包、传递、无人机计划程序服务Package, Delivery, Drone Scheduler services 6个副本6 replicas each

遗憾的是,此负载测试只显示了适度的改进。Unfortunately this load test shows only modest improvement. 传出消息仍未与传入消息保持一致:Outgoing messages are still not keeping up with incoming messages:


吞吐量更一致,但达到的最大值与前面的测试相同:Throughput is more consistent, but the maximum achieved is about the same as the previous test:


而且,查看 容器 Azure Monitor,这似乎是由于群集中的资源耗尽导致的问题。Moreover, looking at Azure Monitor for containers, it appears the problem is not caused by resource exhaustion within the cluster. 首先,节点级指标显示,即使在95% 的情况下,CPU 使用率仍低于40%,内存使用率约为20%。First, the node-level metrics show that CPU utilization remains under 40% even at the 95th percentile, and memory utilization is about 20%.

AKS 节点使用率图

在 Kubernetes 环境中,即使节点不是,单独的 pod 也可能会受到资源限制。In a Kubernetes environment, it's possible for individual pods to be resource-constrained even when the nodes aren't. 但 pod 级视图显示所有盒都处于正常状态。But the pod-level view shows that all pods are healthy.

AKS pod 利用率的图形

从此测试看来,只需将更多箱添加到后端将不会有帮助。From this test, it seems that just adding more pods to the back end won't help. 下一步是更密切地查看工作流服务,以了解处理消息时所发生的情况。The next step is to look more closely at the Workflow service to understand what's happening when it processes messages. Application Insights 显示工作流服务的操作的平均持续时间 Process 为246毫秒。Application Insights shows that the average duration of the Workflow service's Process operation is 246 ms.

Application Insights 的屏幕截图

我们还可以运行查询,以获取每个事务中各个操作的指标:We can also run a query to get metrics on the individual operations within each transaction:

目标target percentile_duration_50percentile_duration_50 percentile_duration_95percentile_duration_95
https://dev-i-iuosnlbwkzkau.servicebus.windows.net/ | dev-i-iuosnlbwkzkau 86.6695020386.66950203 283.4255578283.4255578
调试delivery 3737 5757
package 1212 1717
droneschedulerdronescheduler 2121 4141

此表中的第一行表示服务总线队列。The first row in this table represents the Service Bus queue. 其他行将调用后端服务。The other rows are the calls to the backend services. 对于参考,以下是此表的 Log Analytics 查询:For reference, here's the Log Analytics query for this table:

let start=datetime("2019-07-31T22:30:00.000Z");
let end=datetime("2019-07-31T22:45:00.000Z");
let dataset=dependencies
| where timestamp > start and timestamp < end
| where (cloud_RoleName == 'fabrikam-workflow')
| where name == 'Complete' or target in ('package', 'delivery', 'dronescheduler');
| summarize percentiles(duration, 50, 95) by target

Log Analytics 查询结果的屏幕截图

这些延迟看起来合理。These latencies look reasonable. 但以下是重要的观点:如果总操作时间为 ~ 250 ms,则会对在串行中处理消息的速度进行严格的上限。But here is the key insight: If the total operation time is ~250 ms, that puts a strict upper bound on how fast messages can be processed in serial. 因此,提高吞吐量的关键是更高的并行度。The key to improving throughput, therefore, is greater parallelism.

在这种情况下,应该有两个原因:That should be possible in this scenario, for two reasons:

  • 这些是网络调用,因此大多数时间都用在等待 i/o 完成These are network calls, so most of the time is spent waiting for I/O completion
  • 消息是独立的,无需按顺序进行处理。The messages are independent, and don't need to be processed in order.

测试4:增加并行度Test 4: Increase parallelism

对于此测试,团队侧重于提高并行度。For this test, the team focused on increasing parallelism. 为此,它们在工作流服务使用的服务总线客户端上调整了两个设置:To do so, they adjusted two settings on the Service Bus client used by the Workflow service:

设置Setting 描述Description 默认Default 新值New value
MaxConcurrentCalls 要同时处理的最大消息数。The maximum number of messages to process concurrently. 11 2020
PrefetchCount 客户端将提前提取到其本地缓存中的消息数。How many messages the client will fetch ahead of time into its local cache. 00 30003000

有关这些设置的详细信息,请参阅 使用服务总线消息传送改进性能的最佳做法For more information about these settings, see Best Practices for performance improvements using Service Bus Messaging. 通过这些设置运行测试时生成了以下关系图:Running the test with these settings produced the following graph:


请记住,传入消息显示为浅蓝色,传出消息显示为深蓝色。Recall that incoming messages are shown in light blue, and outgoing messages are shown in dark blue.

乍一看,这是一个非常奇怪的图形。At first glance, this is a very weird graph. 一段时间后,传出消息速率会完全跟踪传入速率。For a while, the outgoing message rate exactly tracks the incoming rate. 接下来,在大约2:03 个标记的情况下,传入消息级别的速率会下降,而传出消息的数目会继续增加,实际超出传入消息总数。But then, at about the 2:03 mark, the rate of incoming messages levels off, while the number of outgoing messages continues to rise, actually exceeding the total number of incoming messages. 这似乎不可能。That seems impossible.

可以在 Application Insights 的 " 依赖关系 " 视图中找到此神秘的线索。The clue to this mystery can be found in the Dependencies view in Application Insights. 此图表汇总了工作流服务对服务总线进行的所有调用:This chart summarizes all of the calls that the Workflow service made to Service Bus:


请注意的条目 DeadLetterNotice that entry for DeadLetter. 此调用指示消息进入服务总线 死信队列That calls indicates messages are going into the Service Bus dead-letter queue.

若要了解发生的情况,需要在服务总线中了解 速览-锁定 语义。To understand what's happening, you need to understand Peek-Lock semantics in Service Bus. 当客户端使用扫视锁定时,服务总线将以原子方式检索和锁定消息。When a client uses Peek-Lock, Service Bus atomically retrieves and locks a message. 锁定持有时,保证不会将消息传递给其他接收方。While the lock is held, the message is guaranteed not to be delivered to other receivers. 如果锁定过期,则该消息将可供其他接收方使用。If the lock expires, the message becomes available to other receivers. (可配置) 的最大传递尝试次数后,服务总线会将消息放入死信 队列,稍后会在该队列中进行检查。After a maximum number of delivery attempts (which is configurable), Service Bus will put the messages onto a dead-letter queue, where it can be examined later.

请记住,工作流服务将一次性预提取大量的消息 — 3000 消息) 。Remember that the Workflow service is prefetching large batches of messages — 3000 messages at a time). 这意味着处理每条消息的总时间较长,导致消息超时,返回队列,最终进入死信队列。That means the total time to process each message is longer, which results in messages timing out, going back onto the queue, and eventually going into the dead-letter queue.

你还可以在异常中查看此行为,其中记录了许多 MessageLostLockException 异常:You can also see this behavior in the exceptions, where numerous MessageLostLockException exceptions are recorded:

显示许多 MessageLostLockException 异常的 Application Insights 异常的屏幕截图。

测试5:增大锁定持续时间Test 5: Increase lock duration

对于此负载测试,消息锁定持续时间设置为5分钟,以防止锁定超时。For this load test, the message lock duration was set to 5 minutes, to prevent lock timeouts. 传入和传出消息的关系图现在表明系统正在保持传入消息的速率:The graph of incoming and outgoing messages now shows that the system is keeping up with the rate of incoming messages:


在8分钟负载测试的总持续时间内,应用程序完成了 25 K 操作,最大吞吐量为72操作数/秒,这表示最大吞吐量增加400%。Over the total duration of the 8-minute load test, the application completed 25 K operations, with a peak throughput of 72 operations/sec, representing a 400% increase in maximum throughput.

消息吞吐量图形,显示最大吞吐量的400% 增长。

但是,在持续时间较长的时间内运行相同的测试会显示应用程序无法满足此速率:However, running the same test with a longer duration showed that the application could not sustain this rate:


容器指标显示最大 CPU 使用率已接近100%。The container metrics show that maximum CPU utilization was close to 100%. 此时,应用程序看起来像是 CPU 绑定的。At this point, the application appears to be CPU-bound. 与上一次横向扩展不同,缩放群集可能会提高性能。Scaling the cluster might improve performance now, unlike the previous attempt at scaling out.

AKS 节点使用率图,显示最大 CPU 使用率接近100%。

测试6:再次横向扩展后端服务 () Test 6: Scale out the backend services (again)

对于系列中的最终负载测试,团队按如下方式扩展了 Kubernetes 群集和 pod:For the final load test in the series, the team scaled out the Kubernetes cluster and pods as follows:

设置Setting Value
群集节点Cluster nodes 1212
引入服务Ingestion service 3 个副本3 replicas
工作流服务Workflow service 6 个副本6 replicas
包、传递、无人机计划程序服务Package, Delivery, Drone Scheduler services 每个副本9个副本9 replicas each

此测试会导致更高的持续吞吐量,并且在处理消息方面不会有太大的滞后。This test resulted in a higher sustained throughput, with no significant lags in processing messages. 而且,节点 CPU 使用率低于80%。Moreover, node CPU utilization stayed below 80%.



对于此方案,确定了以下瓶颈:For this scenario, the following bottlenecks were identified:

  • 用于 Redis 的 Azure 缓存中出现内存不足异常。Out-of-memory exceptions in Azure Cache for Redis.
  • 消息处理中缺少并行性。Lack of parallelism in message processing.
  • 消息锁持续时间不足,导致锁定超时,并将消息放入死信队列中。Insufficient message lock duration, leading to lock timeouts and messages being placed in the dead letter queue.
  • CPU 消耗。CPU exhaustion.

若要诊断这些问题,开发团队需要依赖以下指标:To diagnose these issues, the development team relied on the following metrics:

  • 传入和传出服务总线消息的速率。The rate of incoming and outgoing Service Bus messages.
  • Application Insights 中的应用程序映射。Application Map in Application Insights.
  • 错误和异常。Errors and exceptions.
  • 自定义 Log Analytics 查询。Custom Log Analytics queries.
  • 容器 Azure Monitor 的 CPU 和内存使用率。CPU and memory utilization in Azure Monitor for containers.

后续步骤Next steps

有关此方案的设计的详细信息,请参阅 设计微服务体系结构For more information about the design of this scenario, see Designing a microservices architecture.