效能微調分散式應用程式Performance tuning a distributed application

在本系列中,我們將逐步解說幾個雲端應用程式案例,展示開發小組如何使用負載測試和計量來診斷效能問題。In this series, we walk through several cloud application scenarios, showing how a development team used load tests and metrics to diagnose performance issues. 這些文章是以開發範例應用程式時所執行的實際負載測試為基礎。These articles are based on actual load testing that we performed when developing example applications. 您可在 GitHub 上找到每個案例的程式碼。The code for each scenario is available on GitHub.

案例:Scenarios:

什麼是效能?What is performance?

效能通常是根據輸送量、回應時間和可用性來測量。Performance is frequently measured in terms of throughput, response time, and availability. 效能目標應該以商務作業為基礎。Performance targets should be based on business operations. 客戶面向的工作可能會有比作業工作 (例如產生報表) 更嚴格的需求。Customer-facing tasks may have more stringent requirements than operational tasks such as generating reports.

定義服務等級目標 (SLO) 以定義每個工作負載的效能目標。Define a service level objective (SLO) that defines performance targets for each workload. 您通常會將效能目標細分成一組關鍵效能指標 (KPI) 來達到此目的,例如:You typically achieve this by breaking a performance target into a set of Key Performance Indicators (KPIs), such as:

  • 特定要求的延遲或回應時間Latency or response time of specific requests
  • 每秒執行的要求數The number of requests performed per second
  • 系統產生例外狀況的速率。The rate at which the system generates exceptions.

效能目標應該明確包含目標負載。Performance targets should explicitly include a target load. 此外,並非所有使用者都能獲得完全相同層級的效能,即使同時存取系統並執行相同的工作也一樣。Also, not all users will receive exactly the same level of performance, even when accessing the system simultaneously and performing the same work. 因此,SLO 應該以百分位數的角度來概括。So an SLO should be framed in terms of percentiles.

SLO 的範例可能是:「用戶端要求會在 500 毫秒 @ P90 內回應,每秒負載最多為 25K 個要求。」An example SLO for might be: "Client requests will have a response within 500 ms @ P90, at loads up to 25 K requests/second."

效能微調分散式系統的挑戰Challenges of performance tuning a distributed system

診斷分散式應用程式中的效能問題特別具有挑戰。It can be especially challenging to diagnose performance issues in a distributed application. 其中一些挑戰如下:Some of the challenges are:

  • 單一商務交易或作業通常牽涉到多個系統元件。A single business transaction or operation typically involves multiple components of the system. 要取得單一作業的整體端對端觀點可能很難。It can be hard to get a holistic end-to-end view of a single operation.

  • 資源耗用量會分散到多個節點。Resource consumption is distributed across multiple nodes. 若要取得一致的觀點,您需要在一個位置匯總記錄和計量。To get a consistent view, you need to aggregate logs and metrics in one place.

  • 雲端提供彈性規模。The cloud offers elastic scale. 自動調整是處理負載尖峰的重要技巧,但其也可能是遮罩基礎問題。Autoscaling is an important technique for handling spikes in load, but it can also mask underlying issues. 並且,也很難知道需要調整的元件和調整的時機。Also, it can be hard to know which components need to scale and when.

  • 串聯失敗可能會導致根本問題上游的失敗。Cascading failures can cause failures upstream of the root problem. 結果,問題的第一個信號可能會出現在與根本原因不同的元件中。As a result, the first signal of the problem may appear in a different component than the root cause.

一般最佳作法General best practices

效能微調是一種藝術和科學,但可以採用系統化的方法,使其更接近科學。Performance tuning is both an art and a science, but it can be made closer to science by taking a systematic approach. 以下是一些最佳做法:Here are some best practices:

  • 啟用遙測以收集計量。Enable telemetry to collect metrics. 檢測您的程式碼。Instrument your code. 遵循監視的最佳做法。Follow best practices for monitoring. 使用相互關聯的追蹤,讓您可以查看交易中的所有步驟。Use correlated tracing so that you can view all the steps in a transaction.

  • 監視 90/95/99 百分位數,而不只是平均值。Monitor the 90/95/99 percentiles, not just average. 平均值可以遮罩極端值。The average can mask outliers. 計量的取樣率也很重要。The sampling rate for metrics also matters. 如果取樣率過低,它可以隱藏可能表示問題的尖峰或極端值。If the sampling rate is too low, it can hide spikes or outliers that might indicate problems.

  • 一次攻擊一個瓶頸。Attack one bottleneck at a time. 形成假設並一次變更一個變數來進行測試。Form a hypothesis and test it by changing one variable at a time. 移除一個瓶頸通常會更進一步地發現上游或下游的另一個瓶頸。Removing one bottleneck will often uncover another bottleneck further upstream or downstream.

  • 錯誤和重試可能會對效能造成很大的影響。Errors and retries can have a large impact on performance. 如果您已看到自己正受到後端服務節流限制,請擴增服務或嘗試最佳化使用方式 (例如,微調資料庫查詢)。If you see that you are being throttled by backend services, scale out or try to optimize usage (for example by tuning database queries).

  • 尋找常見的效能反向模式Look for common performance anti-patterns.

  • 尋找平行處理的機會。Look for opportunities to parallelize. 兩個常見的瓶頸來源是訊息佇列和資料庫。Two common sources of bottlenecks are message queues and databases. 在這兩種情況下,分區化很實用。In both cases, sharding can help. 如需詳細資訊,請參閱水平、垂直和功能性資料分割For more information, see Horizontal, vertical, and functional data partitioning. 尋找可能表示不平衡讀取或寫入負載的熱分割。Look for hot partitions that might indicate imbalanced read or write loads.

後續步驟Next steps

閱讀效能微調案例Read the performance tuning scenarios