管理與監視Management and monitoring

規劃平臺管理與監視Plan platform management and monitoring

本節將探討如何在平台層級進行集中式管理和監視,以操作維護 Azure 企業資產。This section explores how to operationally maintain an Azure enterprise estate with centralized management and monitoring at a platform level. 更具體來說,它會提供主要的建議,讓中央團隊在大規模的 Azure 平臺中維持操作的可見度。More specifically, it presents key recommendations for central teams to maintain operational visibility within a large-scale Azure platform.

顯示管理和監視的圖表。

圖1:平臺管理與監視。Figure 1: Platform management and monitoring.

設計考慮:Design considerations:

  • 使用 Azure 監視器 Log Analytics 工作區作為系統管理界限。Use an Azure Monitor Log Analytics workspace as an administrative boundary.

  • 以應用程式為中心的平臺監視,分別涵蓋適用于計量和記錄的經常性存取和冷遙測路徑:Application-centric platform monitoring, encompassing both hot and cold telemetry paths for metrics and logs, respectively:

    • 作業系統計量;例如,效能計數器和自訂計量Operating system metrics; for example, performance counters and custom metrics
    • 作業系統記錄檔;例如,Internet Information Services、Windows 事件追蹤和 syslogOperating system logs; for example, Internet Information Services, Event Tracing for Windows, and syslogs
    • 資源健康狀態事件Resource health events
  • 安全性審核記錄,並在整個組織的整個 Azure 資產中達成水準安全性鏡頭:Security audit logging and achieving a horizontal security lens across your organization's entire Azure estate:

    • 內部部署安全性資訊和事件管理的潛在整合 (SIEM) 系統,例如 ServiceNow、ArcSight 或 Onapsis 安全性平臺Potential integration with on-premises security information and event management (SIEM) systems such as ServiceNow, ArcSight, or the Onapsis security platform
    • Azure 活動記錄Azure activity logs
    • Azure Active Directory (Azure AD) 審核報表Azure Active Directory (Azure AD) audit reports
    • Azure 診斷服務、記錄和計量;Azure Key Vault audit 事件;網路安全性群組 (NSG) 流量記錄;和事件記錄檔Azure diagnostic services, logs, and metrics; Azure Key Vault audit events; network security group (NSG) flow logs; and event logs
    • Azure 監視器、Azure 網路監看員、Azure 資訊安全中心和 Azure SentinelAzure Monitor, Azure Network Watcher, Azure Security Center, and Azure Sentinel
  • Azure 資料保留閾值和封存需求:Azure data retention thresholds and archiving requirements:

    • Azure 監視器記錄的預設保留期間為30天,最多為兩年。The default retention period for Azure Monitor Logs is 30 days, with a maximum of two years.
    • Azure AD 報表 (premium) 的預設保留期間為30天。The default retention period for Azure AD reports (premium) is 30 days.
    • Azure 診斷服務的預設保留期限為90天。The default retention period for the Azure diagnostic service is 90 days.
  • 操作需求:Operational requirements:

    • 具有原生工具(例如 Azure 監視器記錄或協力廠商工具)的營運儀表板Operational dashboards with native tools such as Azure Monitor Logs or third-party tooling
    • 使用集中式角色控制具有特殊許可權的活動Controlling privileged activities with centralized roles
    • 適用于azure 資源的受控識別,可存取 azure 服務Managed identities for Azure resources for access to Azure services
    • 用來保護編輯和刪除資源的資源鎖定Resource locks to protect editing and deleting resources

設計建議:Design recommendations:

  • 使用單一 監視器記錄工作區 來集中管理平臺,但 azure 角色型存取控制 (azure RBAC) 、資料主權需求和資料保留原則會強制執行不同的工作區。Use a single monitor logs workspace to manage platforms centrally except where Azure role-based access control (Azure RBAC), data sovereignty requirements and data retention policies mandate separate workspaces. 集中式記錄對於營運管理小組所需的可見度而言是不可或缺的。Centralized logging is critical to the visibility required by operations management teams. 記錄集中的驅動有關變更管理、服務健康狀態、設定,以及 IT 營運的大部分其他層面的報告。Logging centralization drives reports about change management, service health, configuration, and most other aspects of IT operations. 在集中式工作區模型上進行融合可減少系統管理工作,以及可檢視性中的間隙機會。Converging on a centralized workspace model reduces administrative effort and the chances for gaps in observability.

    在企業級架構的環境中,集中式記錄主要與平台作業有關。In the context of the enterprise-scale architecture, centralized logging is primarily concerned with platform operations. 這項強調不會防止針對以 VM 為基礎的應用程式記錄使用相同的工作區。This emphasis doesn't prevent the use of the same workspace for VM-based application logging. 使用以資源為中心的存取控制模式設定的工作區時,會強制執行細微的 Azure RBAC,以確保應用程式小組只能存取其資源的記錄。With a workspace configured in resource-centric access control mode, granular Azure RBAC is enforced to ensure application teams will only have access to the logs from their resources. 在此模型中,應用程式小組可透過減少其管理開銷,從使用現有的平台基礎結構中受益。In this model, application teams benefit from the use of existing platform infrastructure by reducing their management overhead. 對於任何非計算資源(例如 web 應用程式或 Azure Cosmos DB 資料庫),應用程式小組可以使用自己的 Log Analytics 工作區,並設定診斷和計量以在此路由傳送。For any non-compute resources such as web apps or Azure Cosmos DB databases, application teams can use their own Log Analytics workspaces and configure diagnostics and metrics to be routed here.

  • 如果記錄保留需求超過兩年,請將記錄匯出至 Azure 儲存體。Export logs to Azure Storage if log retention requirements exceed two years. 使用不可變的儲存體搭配寫入一次、讀取多個原則,讓資料在使用者指定的間隔內不可清除且不可修改。Use immutable storage with a write-once, read-many policy to make data non-erasable and non-modifiable for a user-specified interval.
  • 使用 Azure 原則進行存取控制和合規性報告。Use Azure Policy for access control and compliance reporting. Azure 原則能讓您強制執行全組織的設定,以確保一致的原則遵循和快速違規偵測。Azure Policy provides the ability to enforce organization-wide settings to ensure consistent policy adherence and fast violation detection. 如需詳細資訊,請參閱 瞭解 Azure 原則效果For more information, see Understand Azure Policy effects.
  • 使用 Azure 原則監視來賓內的虛擬機器 (VM) 設定漂移。Monitor in-guest virtual machine (VM) configuration drift using Azure Policy. 透過原則啟用 來賓 設定審核功能可協助應用程式小組工作負載立即使用功能功能。Enabling guest configuration audit capabilities through policy helps application team workloads to immediately consume feature capabilities with little effort.
  • 使用 Azure 自動化中的更新管理 作為 Windows 和 Linux vm 的長期修補機制。Use Update Management in Azure Automation as a long-term patching mechanism for both Windows and Linux VMs. 透過 Azure 原則強制執行更新管理設定可確保所有 Vm 都包含在修補程式管理擬訂規則中,並讓應用程式小組能夠管理其 Vm 的修補部署。Enforcing Update Management configurations via Azure Policy ensures that all VMs are included in the patch management regimen and provides application teams with the ability to manage patch deployment for their VMs. 它也會為所有 Vm 的中央 IT 小組提供可見度和強制功能。It also provides visibility and enforcement capabilities to the central IT team across all VMs.
  • 使用網路監看員,透過網路監看員 NSG 流量記錄 v2主動監視流量流程。Use Network Watcher to proactively monitor traffic flows via Network Watcher NSG flow logs v2. 使用分析會分析 NSG 流量記錄,以收集虛擬網路內 IP 流量的深入解析,並提供重要的資訊以進行有效的管理和監視。Traffic Analytics analyzes NSG flow logs to gather deep insights about IP traffic within a virtual network and provides critical information for effective management and monitoring. 流量分析提供的資訊包括大部分的通訊主機和應用程式協定、最常交談的主機配對、允許或封鎖的流量、輸入和輸出流量、開啟網際網路埠、大部分封鎖規則、每個 Azure 資料中心的流量分配、虛擬網路、子網或 rogue 網路。Traffic Analytics provide information such as most communicating hosts and application protocols, most conversing host pairs, allowed or blocked traffic, inbound and outbound traffic, open internet ports, most blocking rules, traffic distribution per an Azure datacenter, virtual network, subnets, or rogue networks.
  • 使用資源鎖定來防止意外刪除重要的共用服務。Use resource locks to prevent accidental deletion of critical shared services.
  • 使用 拒絕原則 來補充 Azure 角色指派。Use deny policies to supplement Azure role assignments. 拒絕原則可用來防止將要求傳送至資源提供者,以防止部署和設定不符合所定義之標準的資源。Deny policies are used to prevent deploying and configuring resources that don't match defined standards by preventing the request from being sent to the resource provider. 拒絕原則和 Azure 角色指派的組合可確保適當的護欄已準備好可部署和設定 資源, 以及可部署 和設定的資源。The combination of deny policies and Azure role assignments ensures the appropriate guardrails are in place to enforce who can deploy and configure resources and what resources they can deploy and configure.
  • 在整體平臺監視解決方案中包含 服務資源 健康狀態事件。Include service and resource health events as part of the overall platform monitoring solution. 從平臺的觀點來追蹤服務和資源健康狀態,是 Azure 中資源管理的重要元件。Tracking service and resource health from the platform perspective is an important component of resource management in Azure.
  • 請勿將原始記錄專案傳送回內部部署監視系統。Don't send raw log entries back to on-premises monitoring systems. 相反地,請採用在 azure 中所提供的 資料會保留在 azure 中 的原則。Instead, adopt a principle that data born in Azure stays in Azure. 如果需要內部部署 SIEM 整合,則 傳送重大警示 ,而不是記錄。If on-premises SIEM integration is required, then send critical alerts instead of logs.

規劃應用程式管理與監視Plan for application management and monitoring

若要在上一節中展開,此區段將會考慮同盟模型,並說明應用程式小組如何以操作方式維護這些工作負載。To expand on the previous section, this section will consider a federated model and explain how application teams can operationally maintain these workloads.

設計考慮:Design considerations:

  • 應用程式監視可以使用專用的 Log Analytics 工作區。Application monitoring can use dedicated Log Analytics workspaces.
  • 針對部署到虛擬機器的應用程式,記錄應該從平臺的觀點集中儲存到專用的 Log Analytics 工作區。For applications that are deployed to virtual machines, logs should be stored centrally to the dedicated Log Analytics workspace from a platform perspective. 應用程式小組可以存取在其應用程式或虛擬機器上受限於 Azure RBAC 的記錄。Application teams can access the logs subject to the Azure RBAC they have on their applications or virtual machines.
  • 適用于基礎結構即服務的應用程式效能和健全狀況監視 (IaaS) 和平臺即服務 (PaaS) 資源。Application performance and health monitoring for both infrastructure as a service (IaaS) and platform as a service (PaaS) resources.
  • 跨所有應用程式元件的資料匯總。Data aggregation across all application components.
  • 健康情況模型和運算化Health modeling and operationalization:
    • 如何測量工作負載及其子系統的健康情況How to measure the health of the workload and its subsystems
    • 代表健康情況的流量燈模型A traffic-light model to represent health
    • 如何在應用程式元件之間回應失敗How to respond to failures across application components

設計建議:Design recommendations:

  • 使用集中式 Azure 監視器 Log Analytics 工作區,從 IaaS 和 PaaS 應用程式資源收集記錄和計量,並 使用 AZURE RBAC 來控制記錄存取Use a centralized Azure Monitor Log Analytics workspace to collect logs and metrics from IaaS and PaaS application resources and control log access with Azure RBAC.
  • 使用 Azure 監視器計量 來進行時間緊迫的分析。Use Azure Monitor metrics for time-sensitive analysis. Azure 監視器中的計量會儲存在經過優化的時間序列資料庫中,以分析時間戳記資料。Metrics in Azure Monitor are stored in a time-series database optimized to analyze time-stamped data. 這些計量非常適用于警示和快速偵測問題。These metrics are well suited for alerts and detecting issues quickly. 它們也可以告訴您系統的執行狀況。They can also tell you how your system is performing. 它們通常需要與記錄結合,以找出問題的根本原因。They typically need to be combined with logs to identify the root cause of issues.
  • 使用 Azure 監視器記錄 來取得深入解析和報告。Use Azure Monitor Logs for insights and reporting. 記錄包含不同類型的資料,這些資料會組織成具有不同屬性集的記錄。Logs contain different types of data that's organized into records with different sets of properties. 它們適合用來分析來自各種來源的複雜資料,例如效能資料、事件和追蹤。They're useful for analyzing complex data from a range of sources, such as performance data, events, and traces.
  • 必要時,請使用登陸區域內的共用儲存體帳戶來儲存 Azure 診斷擴充記錄儲存體。When necessary, use shared storage accounts within the landing zone for Azure diagnostic extension log storage.
  • 使用 Azure 監視器警示 來產生操作警示。Use Azure Monitor alerts for the generation of operational alerts. Azure 監視器警示會統一計量和記錄的警示,並使用動作和智慧群組等功能來進行先進的管理和修復。Azure Monitor alerts unify alerts for metrics and logs and use features such as action and smart groups for advanced management and remediation purposes.