您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

监视和管理Management and monitoring

规划平台管理和监视Plan platform management and monitoring

本部分介绍如何在平台级别使用集中管理和监视来维护 Azure 企业场地。This section explores how to operationally maintain an Azure enterprise estate with centralized management and monitoring at a platform level. 更具体地说,它提供了针对中心团队在大规模 Azure 平台内维护操作可见性的关键建议。More specifically, it presents key recommendations for central teams to maintain operational visibility within a large-scale Azure platform.

显示管理和监视的关系图。

图1:平台管理和监视。Figure 1: Platform management and monitoring.

设计注意事项:Design considerations:

  • 将 Azure Monitor Log Analytics 工作区用作管理边界。Use an Azure Monitor Log Analytics workspace as an administrative boundary.

  • 以应用程序为中心的平台监视,分别包含指标和日志的热和冷遥测路径:Application-centric platform monitoring, encompassing both hot and cold telemetry paths for metrics and logs, respectively:

    • 操作系统度量值;例如,性能计数器和自定义指标Operating system metrics; for example, performance counters and custom metrics
    • 操作系统日志;例如,Internet Information Services、Windows 事件跟踪和一种方法Operating system logs; for example, Internet Information Services, Event Tracing for Windows, and syslogs
    • 资源运行状况事件Resource health events
  • 安全审核日志记录,在整个组织的整个 Azure 场地内实现水平安全镜头:Security audit logging and achieving a horizontal security lens across your organization's entire Azure estate:

    • 可能与本地安全信息和事件管理集成 (SIEM) 系统,如 ServiceNow、ArcSight 或 Onapsis 安全平台Potential integration with on-premises security information and event management (SIEM) systems such as ServiceNow, ArcSight, or the Onapsis security platform
    • Azure 活动日志Azure activity logs
    • Azure Active Directory (Azure AD) 审核报告Azure Active Directory (Azure AD) audit reports
    • Azure 诊断服务、日志和指标;Azure Key Vault 审核事件;网络安全组 (NSG) 流日志;和事件日志Azure diagnostic services, logs, and metrics; Azure Key Vault audit events; network security group (NSG) flow logs; and event logs
    • Azure Monitor、Azure 网络观察程序、Azure 安全中心和 Azure SentinelAzure Monitor, Azure Network Watcher, Azure Security Center, and Azure Sentinel
  • Azure 数据保留阈值和存档要求:Azure data retention thresholds and archiving requirements:

    • Azure Monitor 日志的默认保持期为30天,最大值为2年。The default retention period for Azure Monitor Logs is 30 days, with a maximum of two years.
    • Azure AD 报表的默认保持期 (高级) 为30天。The default retention period for Azure AD reports (premium) is 30 days.
    • Azure 诊断服务的默认保持期为90天。The default retention period for the Azure diagnostic service is 90 days.
  • 操作要求:Operational requirements:

    • 具有本机工具(如 Azure Monitor 日志或第三方工具)的操作仪表板Operational dashboards with native tools such as Azure Monitor Logs or third-party tooling
    • 通过集中式角色控制特权活动Controlling privileged activities with centralized roles
    • 用于访问 Azure 服务的 azure 资源的托管标识Managed identities for Azure resources for access to Azure services
    • 用于保护编辑和删除资源的资源锁Resource locks to protect editing and deleting resources

设计建议:Design recommendations:

  • 使用单个 监视器日志工作区 来集中管理平台,但 azure RBAC) (azure RBAC 的访问控制,数据主权要求和数据保留策略会要求单独的工作区。Use a single monitor logs workspace to manage platforms centrally except where Azure role-based access control (Azure RBAC), data sovereignty requirements and data retention policies mandate separate workspaces. 集中式日志记录对于操作管理团队所需的可见性至关重要。Centralized logging is critical to the visibility required by operations management teams. 日志记录集中驱动器有关更改管理、服务运行状况、配置和 IT 操作的大多数其他方面的报告。Logging centralization drives reports about change management, service health, configuration, and most other aspects of IT operations. 在集中的工作区模型上进行聚合可减少管理工作,并且可观察性中出现间隙。Converging on a centralized workspace model reduces administrative effort and the chances for gaps in observability.

    在企业规模体系结构的上下文中,集中式日志记录主要与平台操作有关。In the context of the enterprise-scale architecture, centralized logging is primarily concerned with platform operations. 这一强调不会阻止为基于 VM 的应用程序日志记录使用同一个工作区。This emphasis doesn't prevent the use of the same workspace for VM-based application logging. 使用在以资源为中心的访问控制模式下配置的工作区时,将强制执行精细的 Azure RBAC,以确保应用程序团队仅有权访问其资源中的日志。With a workspace configured in resource-centric access control mode, granular Azure RBAC is enforced to ensure application teams will only have access to the logs from their resources. 在此模型中,应用程序团队通过减少管理开销,受益于使用现有的平台基础结构。In this model, application teams benefit from the use of existing platform infrastructure by reducing their management overhead. 对于任何非计算资源(例如 web 应用或 Azure Cosmos DB 数据库),应用程序团队可以使用其自己的 Log Analytics 工作区,并在此处配置要路由的诊断和指标。For any non-compute resources such as web apps or Azure Cosmos DB databases, application teams can use their own Log Analytics workspaces and configure diagnostics and metrics to be routed here.

  • 如果日志保留要求超过两年,则将日志导出到 Azure 存储。Export logs to Azure Storage if log retention requirements exceed two years. 将不可变存储与一次写入、读取多个策略结合使用,以便为用户指定的时间间隔进行数据不可擦除和不可修改。Use immutable storage with a write-once, read-many policy to make data non-erasable and non-modifiable for a user-specified interval.
  • 将 Azure 策略用于访问控制和相容性报告。Use Azure Policy for access control and compliance reporting. Azure 策略可强制实施组织范围的设置,以确保策略一致和快速违规检测。Azure Policy provides the ability to enforce organization-wide settings to ensure consistent policy adherence and fast violation detection. 有关详细信息,请参阅 了解 Azure 策略影响For more information, see Understand Azure Policy effects.
  • 使用 Azure 策略监视来宾内虚拟机 (VM) 配置偏移。Monitor in-guest virtual machine (VM) configuration drift using Azure Policy. 通过策略启用 来宾配置 审核功能可帮助应用程序团队工作负荷轻松地使用功能功能。Enabling guest configuration audit capabilities through policy helps application team workloads to immediately consume feature capabilities with little effort.
  • 使用 Azure 自动化中的更新管理 作为 Windows 和 Linux vm 的长期修补机制。Use Update Management in Azure Automation as a long-term patching mechanism for both Windows and Linux VMs. 通过 Azure 策略强制实施更新管理配置可确保所有 Vm 都包含在修补程序管理计划中,并为应用程序团队提供为其 Vm 管理修补程序部署的能力。Enforcing Update Management configurations via Azure Policy ensures that all VMs are included in the patch management regimen and provides application teams with the ability to manage patch deployment for their VMs. 它还向中心 IT 团队跨所有 Vm 提供可见性和强制性功能。It also provides visibility and enforcement capabilities to the central IT team across all VMs.
  • 使用网络观察程序通过 网络观察程序 NSG 流日志 v2主动监视流量流。Use Network Watcher to proactively monitor traffic flows via Network Watcher NSG flow logs v2. 流量分析 分析 NSG flow 日志,以便收集有关虚拟网络中的 IP 流量的深入见解,并提供有效管理和监视的关键信息。Traffic Analytics analyzes NSG flow logs to gather deep insights about IP traffic within a virtual network and provides critical information for effective management and monitoring. 流量分析提供信息,如大多数通信主机和应用程序协议、大多数正在进行的主机对、允许或阻止的流量、入站和出站流量、打开 internet 端口、大多数阻止性规则、每个 Azure 数据中心的流量分布、虚拟网络、子网或恶意网络。Traffic Analytics provide information such as most communicating hosts and application protocols, most conversing host pairs, allowed or blocked traffic, inbound and outbound traffic, open internet ports, most blocking rules, traffic distribution per an Azure datacenter, virtual network, subnets, or rogue networks.
  • 使用资源锁来防止意外删除关键共享服务。Use resource locks to prevent accidental deletion of critical shared services.
  • 使用 拒绝策略 来补充 Azure 角色分配。Use deny policies to supplement Azure role assignments. 拒绝策略用于阻止将请求发送到资源提供程序,从而防止部署和配置不符合已定义标准的资源。Deny policies are used to prevent deploying and configuring resources that don't match defined standards by preventing the request from being sent to the resource provider. "拒绝策略" 和 "Azure 角色分配" 的组合可确保适当的 guardrails,以强制实施可部署和配置 资源的__人员,以及可部署和配置的资源。The combination of deny policies and Azure role assignments ensures the appropriate guardrails are in place to enforce who can deploy and configure resources and what resources they can deploy and configure.
  • 服务资源 运行状况事件作为总体平台监视解决方案的一部分包括在内。Include service and resource health events as part of the overall platform monitoring solution. 从平台的角度来看,跟踪服务和资源运行状况是 Azure 中资源管理的重要组成部分。Tracking service and resource health from the platform perspective is an important component of resource management in Azure.
  • 请勿将原始日志项发送回本地监视系统。Don't send raw log entries back to on-premises monitoring systems. 取而代之的是,将 azure 中的数据保留在 azure 中Instead, adopt a principle that data born in Azure stays in Azure. 如果需要本地 SIEM 集成,则 发送关键警报 而不是日志。If on-premises SIEM integration is required, then send critical alerts instead of logs.

规划应用程序管理和监视Plan for application management and monitoring

若要在上一节中进行扩展,此部分将考虑使用联合模型,并说明应用程序团队如何操作维护这些工作负荷。To expand on the previous section, this section will consider a federated model and explain how application teams can operationally maintain these workloads.

设计注意事项:Design considerations:

  • 应用程序监视可以使用专用 Log Analytics 工作区。Application monitoring can use dedicated Log Analytics workspaces.
  • 对于部署到虚拟机的应用程序,应将日志从平台角度集中存储到专用 Log Analytics 工作区。For applications that are deployed to virtual machines, logs should be stored centrally to the dedicated Log Analytics workspace from a platform perspective. 应用程序团队可以访问使用其应用程序或虚拟机上的 Azure RBAC 的日志。Application teams can access the logs subject to the Azure RBAC they have on their applications or virtual machines.
  • 针对基础结构即服务 (IaaS) 和平台即服务的应用程序性能和运行状况监视 (PaaS) 资源。Application performance and health monitoring for both infrastructure as a service (IaaS) and platform as a service (PaaS) resources.
  • 跨所有应用程序组件的数据聚合。Data aggregation across all application components.
  • 运行状况建模和操作化Health modeling and operationalization:
    • 如何衡量工作负荷及其子系统的运行状况How to measure the health of the workload and its subsystems
    • 用于表示运行状况的流量轻型模型A traffic-light model to represent health
    • 如何对应用程序组件之间的故障做出响应How to respond to failures across application components

设计建议:Design recommendations:

  • 使用集中式 Azure Monitor Log Analytics 工作区从 IaaS 和 PaaS 应用程序资源收集日志和指标,并 使用 AZURE RBAC 控制日志访问Use a centralized Azure Monitor Log Analytics workspace to collect logs and metrics from IaaS and PaaS application resources and control log access with Azure RBAC.
  • 使用 Azure Monitor 度量值 进行区分时间的分析。Use Azure Monitor metrics for time-sensitive analysis. Azure Monitor 中的指标存储在经过优化的时序数据库中,用于分析时间戳数据。Metrics in Azure Monitor are stored in a time-series database optimized to analyze time-stamped data. 这些指标非常适合于警报并快速检测问题。These metrics are well suited for alerts and detecting issues quickly. 它们还可以告诉您系统的执行情况。They can also tell you how your system is performing. 通常需要将它们与日志相结合,以确定问题的根本原因。They typically need to be combined with logs to identify the root cause of issues.
  • 使用 Azure Monitor 日志 获取见解和报表。Use Azure Monitor Logs for insights and reporting. 日志包含不同类型的数据,这些数据由不同的属性集组织到记录中。Logs contain different types of data that's organized into records with different sets of properties. 它们对于分析来自各种源(如性能数据、事件和跟踪)的复杂数据非常有用。They're useful for analyzing complex data from a range of sources, such as performance data, events, and traces.
  • 必要时,使用 Azure 诊断扩展日志存储的登录区域中的共享存储帐户。When necessary, use shared storage accounts within the landing zone for Azure diagnostic extension log storage.
  • 使用 Azure Monitor 警报 来生成操作警报。Use Azure Monitor alerts for the generation of operational alerts. Azure Monitor 警报为指标和日志统一警报并使用操作和智能组等功能进行高级管理和修正。Azure Monitor alerts unify alerts for metrics and logs and use features such as action and smart groups for advanced management and remediation purposes.