您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

软件质量的要素Pillars of software quality

一个成功的云应用程序应注重软件质量的五大构成要素:可伸缩性、可用性、复原能力、管理和安全性。A successful cloud application will focus on these five pillars of software quality: Scalability, availability, resiliency, management, and security.

构成要素Pillar 描述Description
可伸缩性Scalability 系统处理增加的负载的能力。The ability of a system to handle increased load.
可用性Availability 系统正常工作时间所占的比例。The proportion of time that a system is functional and working.
复原Resiliency 系统从故障中恢复并继续正常运行的能力。The ability of a system to recover from failures and continue to function.
管理Management 让系统在生产环境中持续运行的操作过程。Operations processes that keep a system running in production.
安全Security 保护应用程序和数据免受威胁。Protecting applications and data from threats.

可伸缩性Scalability

可伸缩性是指系统处理增加的负载的能力。Scalability is the ability of a system to handle increased load. 应用程序可通过两种主要方式进行扩展。There are two main ways that an application can scale. 垂直扩展(纵向扩展)指增加资源的容量,例如通过使用更大的 VM。Vertical scaling (scaling up) means increasing the capacity of a resource, for example by using a larger VM size. 水平扩展(横向扩展)指添加资源的新实例,比如 VM 或数据库副本。Horizontal scaling (scaling out) is adding new instances of a resource, such as VMs or database replicas.

水平扩展相较垂直扩展具有明显优势:Horizontal scaling has significant advantages over vertical scaling:

  • 真正的云规模。True cloud scale. 可将应用程序设计为在数百个甚至数千个节点上运行,其规模是在单个节点上无法达到的。Applications can be designed to run on hundreds or even thousands of nodes, reaching scales that are not possible on a single node.
  • 水平扩展具有弹性。Horizontal scale is elastic. 如果负载增加,可以添加更多实例;在较安静的时间段,则可以删除实例。You can add more instances if load increases, or remove them during quieter periods.
  • 可以按计划或为响应负载变化,自动触发横向扩展。Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
  • 横向扩展可能比纵向扩展更便宜。Scaling out may be cheaper than scaling up. 运行多个小型 VM 比运行单个大型 VM 的成本更低。Running several small VMs can cost less than a single large VM.
  • 水平扩展还可通过添加冗余提高复原能力。Horizontal scaling can also improve resiliency, by adding redundancy. 如果某个实例出现故障,应用程序将继续运行。If an instance goes down, the application keeps running.

垂直扩展的一个优点是,扩展时不必对应用程序进行任何更改。An advantage of vertical scaling is that you can do it without making any changes to the application. 但会在某个时候达到极限,即,再也无法纵向扩展。But at some point you'll hit a limit, where you can't scale any up any more. 这时,任何进一步的扩展都只能是水平扩展。At that point, any further scaling must be horizontal.

必须将水平扩展设计到系统中。Horizontal scale must be designed into the system. 例如,可通过将 VM 放在负载均衡器后面来横向扩展 VM。For example, you can scale out VMs by placing them behind a load balancer. 但池中的每个 VM 都必须能够处理任何客户端请求,因此应用程序必须无状态或将状态存储在外部(例如,在分布式缓存中)。But each VM in the pool must be able to handle any client request, so the application must be stateless or store state externally (say, in a distributed cache). 托管的 PaaS 服务通常具有水平缩放和内置的自动缩放。Managed PaaS services often have horizontal scaling and autoscaling built in. 能轻松扩展这些服务是使用 PaaS 服务的主要优点。The ease of scaling these services is a major advantage of using PaaS services.

不过,只添加更多实例并不意味着就扩展了应用程序。Just adding more instances doesn't mean an application will scale, however. 它可能只是将瓶颈推到了其他地方。It might simply push the bottleneck somewhere else. 例如,如果扩展 Web 前端以处理更多客户端请求,则可能在数据库中触发锁争用。For example, if you scale a web front-end to handle more client requests, that might trigger lock contentions in the database. 然后,你就得考虑其他对策,比如乐观并发或数据分区,以提高数据库的吞吐量。You would then need to consider additional measures, such as optimistic concurrency or data partitioning, to enable more throughput to the database.

始终执行性能和负载测试以发现这些潜在瓶颈。Always conduct performance and load testing to find these potential bottlenecks. 系统的有状态部分(如数据库)是导致瓶颈最常见的原因,因此在设计水平扩展时需谨慎。The stateful parts of a system, such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally. 解决一个瓶颈可能会暴露其他位置的其他瓶颈。Resolving one bottleneck may reveal other bottlenecks elsewhere.

使用可伸缩性清单从可伸缩性角度审查你的设计。Use the Scalability checklist to review your design from a scalability standpoint.

可伸缩性指南Scalability guidance

可用性Availability

可用性指系统正常工作时间所占的比例。Availability is the proportion of time that the system is functional and working. 通常通过运行时间百分比衡量。It is usually measured as a percentage of uptime. 应用程序错误、基础结构问题和系统负载都会降低可用性。Application errors, infrastructure problems, and system load can all reduce availability.

云应用程序应具有一个服务级别目标 (SLO),以明确定义预期的可用性以及如何衡量可用性。A cloud application should have a service level objective (SLO) that clearly defines the expected availability, and how the availability is measured. 定义可用性时,请查看关键路径。When defining availability, look at the critical path. Web 前端可能能够处理客户端请求,但如果每个事务都因无法连接到数据库而失败,用户将无法使用该应用程序。The web front-end might be able to service client requests, but if every transaction fails because it can't connect to the database, the application is not available to users.

通常以“9s”的方式描述可用性 — 例如,“四个 9”意味着 99.99% 的运行时间。Availability is often described in terms of "9s" — for example, "four 9s" means 99.99% uptime. 下表展示不同可用性级别的潜在累积故障时间。The following table shows the potential cumulative downtime at different availability levels.

运行时间百分比% Uptime 每周故障时间Downtime per week 每月故障时间Downtime per month 每年故障时间Downtime per year
99%99% 1.68 小时1.68 hours 7.2 小时7.2 hours 3.65 天3.65 days
99.9%99.9% 10 分钟10 minutes 43.2 分钟43.2 minutes 8.76 小时8.76 hours
99.95%99.95% 5 分钟5 minutes 21.6 分钟21.6 minutes 4.38 小时4.38 hours
99.99%99.99% 1 分钟1 minute 4.32 分钟4.32 minutes 52.56 分钟52.56 minutes
99.999%99.999% 6 秒6 seconds 26 秒26 seconds 5.26 分钟5.26 minutes

请注意,99% 的运行时间意味着每周将近 2 小时的服务中断时间。Notice that 99% uptime could translate to an almost 2-hour service outage per week. 对于许多应用程序,特别是面向使用者的应用程序,这是一个不可接受的 SLO。For many applications, especially consumer-facing applications, that is not an acceptable SLO. 另一方面,五个 9 (99.999%)表示不超过五分钟内的故障时间On the other hand, five 9s (99.999%) means no more than five minutes of downtime in a year. 以这么快的速度检测一次中断都很难做到,更别说解决问题了。It's challenging enough just detecting an outage that quickly, let alone resolving the issue. 若要获取非常高的可用性(99.99% 或更高),不能依靠手动干预从故障中恢复。To get very high availability (99.99% or higher), you can't rely on manual intervention to recover from failures. 应用程序必须自我诊断和自我修复,此时复原能力就变得至关重要。The application must be self-diagnosing and self-healing, which is where resiliency becomes crucial.

在 Azure 中,服务级别协议 (SLA) 描述 Microsoft 关于运行时间和连接方面的承诺。In Azure, the Service Level Agreement (SLA) describes Microsoft's commitments for uptime and connectivity. 如果针对特定服务的 SLA 为 99.95%,则意味着该服务应该在 99.95% 的时间内可用。If the SLA for a particular service is 99.95%, it means you should expect the service to be available 99.95% of the time.

应用程序通常依赖于多个服务。Applications often depend on multiple services. 一般来说,任一服务发生故障的概率是独立的。In general, the probability of either service having downtime is independent. 例如,假设应用程序依赖于两个服务,每个服务的 SLA 都为 99.9%。For example, suppose your application depends on two services, each with a 99.9% SLA. 那么,这两个服务的复合 SLA 为 99.9% × 99.9% ≈ 99.8%,或略小于单独的每个服务。The composite SLA for both services is 99.9% × 99.9% ≈ 99.8%, or slightly less than each service by itself.

可用性指南Availability guidance

复原Resiliency

复原能力是指系统从故障中恢复并继续正常运行的能力。Resiliency is the ability of the system to recover from failures and continue to function. 复原能力的目标是在故障发生后将应用程序恢复到可完全正常运行的状态。The goal of resiliency is to return the application to a fully functioning state after a failure occurs. 复原能力与可用性密切相关。Resiliency is closely related to availability.

传统应用程序开发一直将焦点放在如何缩短平均故障间隔时间 (MTBF) 上,In traditional application development, there has been a focus on reducing mean time between failures (MTBF). 并尝试各种办法防止系统出现故障。Effort was spent trying to prevent the system from failing. 在云计算中,必须采用不同的思维方式,原因如下:In cloud computing, a different mindset is required, due to several factors:

  • 分布式系统很复杂,一个点的故障可能在整个系统中级联。Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
  • 云环境通过使用商用硬件保持低成本,因此必须预料到偶尔的硬件故障。Costs for cloud environments are kept low through the use of commodity hardware, so occasional hardware failures must be expected.
  • 应用程序通常依赖于外部服务,这些服务可能会变得暂时不可用或限制大量用户。Applications often depend on external services, which may become temporarily unavailable or throttle high-volume users.
  • 现在的用户都希望应用程序能够全天候可用,永不下线。Today's users expect an application to be available 24/7 without ever going offline.

所有这些因素都意味着设计云应用程序时必须预料到偶发故障并从中恢复。All of these factors mean that cloud applications must be designed to expect occasional failures and recover from them. Azure 已向平台内置许多复原功能。Azure has many resiliency features already built into the platform. 例如:For example:

  • Azure 存储、SQL 数据库和 Cosmos DB 都在区域内以及跨区域提供内置数据复制。Azure Storage, SQL Database, and Cosmos DB all provide built-in data replication, both within a region and across regions.
  • Azure 托管的磁盘自动放置在不同的存储缩放单位,以限制硬件故障的影响。Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware failures.
  • 可用性集中的 VM 分布在多个容错域。VMs in an availability set are spread across several fault domains. 容错域是指一组共享公共电源和网络交换机的 VM。A fault domain is a group of VMs that share a common power source and network switch. 跨容错域分布 VM 可限制物理硬件故障、网络中断或断电的影响。Spreading VMs across fault domains limits the impact of physical hardware failures, network outages, or power interruptions.

话虽如此,你仍需构建应用程序的复原能力。That said, you still need to build resiliency into your application. 复原策略可应用于体系结构的所有级别。Resiliency strategies can be applied at all levels of the architecture. 有些缓解措施本质上更具战术意义 — 例如,在暂时性网络故障后重试远程调用。Some mitigations are more tactical in nature — for example, retrying a remote call after a transient network failure. 其他缓解措施则更具战略意义,比如将整个应用程序故障转移到次要区域。Other mitigations are more strategic, such as failing over the entire application to a secondary region. 战术性缓解措施可以带来很大变化。Tactical mitigations can make a big difference. 整个区域都发生中断的情况很少见,像网络拥塞这样的暂时性问题则更常见 — 因此先锁定这些问题。While it's rare for an entire region to experience a disruption, transient problems such as network congestion are more common — so target these first. 正确的监视和诊断也很重要,它们都能检测到正在发生的故障并找到根本原因。Having the right monitoring and diagnostics is also important, both to detect failures when they happen, and to find the root causes.

设计可复原的应用程序时,必须了解可用性要求。When designing an application to be resilient, you must understand your availability requirements. 可以接受多长的故障时间?How much downtime is acceptable? 这在一定程度上取决于成本。This is partly a function of cost. 潜在的停机会给业务造成多大的损失?How much will potential downtime cost your business? 使应用程序保持高可用性需要投入多少资金?How much should you invest in making the application highly available?

复原指南Resiliency guidance

管理和 DevOpsManagement and DevOps

此构成要素涵盖让应用程序在生产环境中持续运行的操作过程。This pillar covers the operations processes that keep an application running in production.

部署必须可靠且可预测。Deployments must be reliable and predictable. 它们应实现自动化,以减少人为失误的可能性。They should be automated to reduce the chance of human error. 它们应当是一个快速、例行的过程,这样就不会拖慢新功能或 bug 修复的发布。They should be a fast and routine process, so they don't slow down the release of new features or bug fixes. 如果更新出现问题,你必须能够快速回滚或前滚,这一点也同样重要。Equally important, you must be able to quickly roll back or roll forward if an update has problems.

监视和诊断至关重要。Monitoring and diagnostics are crucial. 云应用程序在远程数据中心内运行,在此中心内,无法完全控制基础结构,或者在某些情况下无法控制操作系统。Cloud applications run in a remote datacenter where you do not have full control of the infrastructure or, in some cases, the operating system. 在大型应用程序中,不可能登录到 VM 来解决问题或仔细查看日志文件。In a large application, it's not practical to log into VMs to troubleshoot an issue or sift through log files. 使用 PaaS 服务时,可能根本就没有可登录的专用 VM。With PaaS services, there may not even be a dedicated VM to log into. 通过监视和诊断,你可以深入了解系统,以便知道故障在何时及何处出现。Monitoring and diagnostics give insight into the system, so that you know when and where failures occur. 所有系统都必须可观测。All systems must be observable. 可使用常见的一致日志记录架构,以便跨系统关联事件。Use a common and consistent logging schema that lets you correlate events across systems.

监视和诊断过程包含多个不同的阶段:The monitoring and diagnostics process has several distinct phases:

  • 检测。Instrumentation. 根据应用程序日志、Web 服务器日志、Azure 平台内置的诊断以及其他来源生成原始数据。Generating the raw data, from application logs, web server logs, diagnostics built into the Azure platform, and other sources.
  • 收集和存储。Collection and storage. 将数据整合到一个位置。Consolidating the data into one place.
  • 分析和诊断。Analysis and diagnosis. 用于解决问题,查看总体运行状况。To troubleshoot issues and see the overall health.
  • 可视化和警报。Visualization and alerts. 使用遥测数据发现趋势或向运营团队发出警报。Using telemetry data to spot trends or alert the operations team.

使用 DevOps 清单从管理和 DevOps 角度审查你的设计。Use the DevOps checklist to review your design from a management and DevOps standpoint.

管理和 DevOps 指南Management and DevOps guidance

安全Security

你必须考虑从设计和实现到部署和操作的整个应用程序生命周期的安全性。You must think about security throughout the entire lifecycle of an application, from design and implementation to deployment and operations. Azure 平台会提供保护以应对各种威胁,如网络入侵和 DDoS 攻击。The Azure platform provides protections against a variety of threats, such as network intrusion and DDoS attacks. 但你仍需在应用程序和 DevOps 过程中构建安全性。But you still need to build security into your application and into your DevOps processes.

下面是一些需要考虑的较广泛的安全领域。Here are some broad security areas to consider.

身份管理Identity management

请考虑使用 Azure Active Directory (Azure AD) 对用户进行身份验证和授权。Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD 是一项完全托管的标识和访问管理服务。Azure AD is a fully managed identity and access management service. 该服务可用于创建仅存在于 Azure 的域,或与本地 Active Directory 标识集成。You can use it to create domains that exist purely on Azure, or integrate with your on-premises Active Directory identities. Azure AD 还与 Office365、Dynamics CRM Online 和许多第三方 SaaS 应用程序集成。Azure AD also integrates with Office365, Dynamics CRM Online, and many third-party SaaS applications. 对于面向使用者的应用程序,Azure Active Directory B2C 允许用户使用其现有社交帐户(如 Facebook、Google 或 LinkedIn)进行身份验证,或者创建由 Azure AD 管理的新用户帐户。For consumer-facing applications, Azure Active Directory B2C lets users authenticate with their existing social accounts (such as Facebook, Google, or LinkedIn), or create a new user account that is managed by Azure AD.

若要将本地 Active Directory 环境与 Azure 网络集成,可通过多种方法实现,具体视你的要求而定。If you want to integrate an on-premises Active Directory environment with an Azure network, several approaches are possible, depending on your requirements. 有关详细信息,请参阅我们的标识管理参考体系结构。For more information, see our Identity Management reference architectures.

保护基础结构Protecting your infrastructure

控制对已部署的 Azure 资源的访问。Control access to the Azure resources that you deploy. 每个 Azure 订阅都与某个 Azure AD 租户存在信任关系Every Azure subscription has a trust relationship with an Azure AD tenant. 使用基于角色的访问控制 (RBAC) 可以在组织内的用户授予对 Azure 资源的正确权限。Use role-based access control (RBAC) to grant users within your organization the correct permissions to Azure resources. 通过向用户或组分配 RBAC 角色,授予对特定范围的访问权限。Grant access by assigning RBAC role to users or groups at a certain scope. 该范围可以是订阅、资源组或单个资源。The scope can be a subscription, a resource group, or a single resource. 审核对基础结构的所有更改。Audit all changes to infrastructure.

应用程序安全性Application security

一般来说,应用程序开发的安全性最佳做法在云端仍然适用。In general, the security best practices for application development still apply in the cloud. 其中包括随处使用 SSL、防止 CSRF 和 XSS 攻击、阻止 SQL 注入攻击等等。These include things like using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL injection attacks, and so on.

云应用程序通常使用具有访问密钥的托管服务。Cloud applications often use managed services that have access keys. 绝不要将这些服务签入源控件中。Never check these into source control. 请考虑将应用程序密码存储到 Azure Key Vault 中。Consider storing application secrets in Azure Key Vault.

数据自主性和加密Data sovereignty and encryption

使用 Azure 的高可用性时,确保数据一直位于正确的地缘政治区域中。Make sure that your data remains in the correct geopolitical zone when using Azure's highly available. Azure 的异地复制存储采用了同一地缘政治区域中的配对区域这一概念。Azure's geo-replicated storage uses the concept of a paired region in the same geopolitical region.

使用 Key Vault 保护加密密钥和密码。Use Key Vault to safeguard cryptographic keys and secrets. 通过使用 Key Vault,可以利用受硬件安全模块 (HSM) 保护的密钥来加密密钥和密码。By using Key Vault, you can encrypt keys and secrets by using keys that are protected by hardware security modules (HSMs). 许多 Azure 存储和 DB 服务支持静态数据加密,包括 Azure 存储Azure SQL 数据库Azure SQL 数据仓库Cosmos DBMany Azure storage and DB services support data encryption at rest, including Azure Storage, Azure SQL Database, Azure SQL Data Warehouse, and Cosmos DB.

安全性资源Security resources