雲端監視指南:收集正確的資料Cloud monitoring guide: Collect the right data

本文說明一些在雲端應用程式中收集監視資料的考慮。This article describes some considerations for collecting monitoring data in a cloud application.

若要觀察雲端解決方案的健全狀況和可用性,您必須設定監視工具來收集以可預測失敗狀態為基礎的信號層級。To observe the health and availability of your cloud solution, you must configure the monitoring tools to collect a level of signals that are based on predictable failure states. 這些信號是失敗的徵兆,而不是原因。These signals are the symptoms of the failure, not the cause. 監視工具會使用計量,並針對先進的診斷和根本原因分析記錄進行記錄。The monitoring tools use metrics and, for advanced diagnostics and root cause analysis, logs.

仔細規劃監視和遷移。Plan for monitoring and migration carefully. 一開始請將監視服務擁有者、操作管理員和其他相關人員納入規劃階段,並在整個開發和發行週期中繼續參與。Start by including the monitoring service owner, the manager of operations, and other related personnel during the planning phase, and continue engaging them throughout the development and release cycle. 其焦點將是根據下列準則來開發監視設定:Their focus will be to develop a monitoring configuration that's based on the following criteria:

  • 這項服務的組合為何?What is the composition of the service? 這些相依性現在是否受到監視?Are those dependencies monitored today? 如果有,是否牽涉到多個工具?If so, are there multiple tools involved? 是否有機會合並,而不會產生風險?Is there an opportunity to consolidate, without introducing risks?
  • 服務的 SLA 為何,以及我要如何衡量和報告?What is the SLA of the service, and how will I measure and report it?
  • 當事件引發時,服務儀表板應看起來是什麼樣子?What should the service dashboard look like when an incident is raised? 儀表板看起來應該像服務擁有者,以及支援服務的小組為何?What should the dashboard look like for the service owner, and for the team that supports the service?
  • 資源產生了哪些計量需要監視?What metrics does the resource produce that I need to monitor?
  • 服務擁有者、支援小組和其他人員將如何搜尋記錄?How will the service owner, support teams, and other personnel be searching the logs?

您如何回答這些問題,以及警示的準則,會決定您將如何使用監視平臺。How you answer those questions, and the criteria for alerting, determines how you'll use the monitoring platform. 如果您要從現有的監視平臺或監視工具集進行遷移,請使用「遷移」作為機會重新評估您收集的信號。If you're migrating from an existing monitoring platform or set of monitoring tools, use the migration as an opportunity to reevaluate the signals you collect. 尤其是當您遷移或整合雲端式監視平臺(例如 Azure 監視器)時,必須考慮幾個成本因素。This is especially true now that there are several cost factors to consider when you migrate or integrate with a cloud-based monitoring platform like Azure Monitor. 請記住,監視資料必須是可採取動作的。Remember, monitoring data needs to be actionable. 您需要收集優化的資料,為您提供服務整體健全狀況的「10000英尺」觀點。You need to have optimized data collected to give you "a 10,000 foot view" of the overall health of the service. 定義來識別真實事件的檢測應該盡可能簡單、可預測且可靠。The instrumentation that's defined to identify real incidents should be as simple, predictable, and reliable as possible.

開發監視設定Develop a monitoring configuration

監視服務擁有者和小組通常會遵循一組共同的活動來開發監視設定。The monitoring service owner and team typically follow a common set of activities to develop a monitoring configuration. 這些活動從最初的規劃階段開始,繼續在非生產環境中測試和驗證,然後擴充以部署到生產環境。These activities start at the initial planning stages, continue through testing and validating in a nonproduction environment, and extend to deploying into production. 監視設定衍生自已知的失敗模式、模擬失敗的測試結果,以及組織中許多人的經驗, (服務台、營運、工程師和開發人員) 。Monitoring configurations are derived from known failure modes, test results of simulated failures, and the experience of several people in the organization (the service desk, operations, engineers, and developers). 這類設定假設服務已經存在,它正在遷移至雲端,但尚未重新架構。Such configurations assume that the service already exists, it's being migrated to the cloud, and it hasn't been rearchitected.

針對服務層級品質結果,在開發過程中及早監視這些服務的健全狀況和可用性。For service-level quality results, monitor the health and availability of these services early in the development process. 如果您事後監視該服務或應用程式的設計,您的結果將會失敗。If you monitor the design of that service or application as an afterthought, your results won't be as successful.

若要提高事件的解決速度,請考慮下列建議:To drive quicker resolution of the incident, consider the following recommendations:

  • 定義每個服務元件的儀表板。Define a dashboard for each service component.
  • 使用計量來協助引導進一步的診斷,並找出無法發現根本原因時的問題解決方式或因應措施。Use metrics to help guide further diagnosis and to identify a resolution or workaround of the issue if a root cause can't be uncovered.
  • 使用儀表板向下切入功能,或支援自訂視圖以進行調整。Use dashboard drill-down capabilities, or support customizing the view to refine it.
  • 如果您需要詳細資訊記錄,計量應可協助以搜尋條件為目標。If you need verbose logs, metrics should have helped target the search criteria. 如果計量沒有説明,請針對下一個事件進行改善。If the metrics didn't help, improve them for the next incident.

採用這組準則,可協助您獲得近乎即時的深入解析,以及更妥善地管理您的服務。Embracing this guiding set of principles can help give you near-real-time insights, as well as better management of your service.

下一步Next steps