軟體品質的要素Pillars of software quality

成功的雲端應用程式將著重於軟體品質的這五大支柱:延展性、可用性、復原、管理和安全性。A successful cloud application will focus on these five pillars of software quality: Scalability, availability, resiliency, management, and security.

要素Pillar 說明Description
延展性Scalability 系統處理負載增加的能力。The ability of a system to handle increased load.
可用性Availability 系統運作並執行作業的時間比例。The proportion of time that a system is functional and working.
復原功能Resiliency 系統從失敗中復原並繼續運作的能力。The ability of a system to recover from failures and continue to function.
管理Management 讓系統在生產環境中順利運作的作業流程。Operations processes that keep a system running in production.
安全性Security 保護應用程式和資料,使其免於威脅。Protecting applications and data from threats.

延展性Scalability

延展性是系統處理負載增加的能力。Scalability is the ability of a system to handle increased load. 應用程式有兩種主要的調整方式。There are two main ways that an application can scale. 垂直調整 (「相應增加」) 代表增加資源的容量,例如使用較大的 VM 大小。Vertical scaling (scaling up) means increasing the capacity of a resource, for example by using a larger VM size. 水平調整 (「相應放大」) 則會為資源 (例如 VM 或資料庫複本) 新增執行個體。Horizontal scaling (scaling out) is adding new instances of a resource, such as VMs or database replicas.

水平調整的優勢明顯大過垂直調整:Horizontal scaling has significant advantages over vertical scaling:

  • 真正的雲端規模。True cloud scale. 可以將應用程式設計成在數百乃至數千個節點上執行,達到單一節點所不可能提供的規模。Applications can be designed to run on hundreds or even thousands of nodes, reaching scales that are not possible on a single node.
  • 水平調整具有彈性。Horizontal scale is elastic. 您可以在負載增加時新增更多執行個體,也可以在負載較少時予以移除。You can add more instances if load increases, or remove them during quieter periods.
  • 相應放大程序可自動觸發,不論是根據排程或因應負載的變化。Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
  • 相應放大的成本可能低於相應增加。Scaling out may be cheaper than scaling up. 執行數個小型 VM 的成本會少於執行單一大型 VM。Running several small VMs can cost less than a single large VM.
  • 藉由增加備援執行個體,水平調整也可以提升復原功能。Horizontal scaling can also improve resiliency, by adding redundancy. 如果有某執行個體停止運作,應用程式還是會繼續執行。If an instance goes down, the application keeps running.

垂直調整的優點是不必變更應用程式就可進行。An advantage of vertical scaling is that you can do it without making any changes to the application. 但是到達一定規模後就會出現限制,而無法再進行相應增加。But at some point you'll hit a limit, where you can't scale any up any more. 屆時,如果要再擴大規模,就必須進行水平調整。At that point, any further scaling must be horizontal.

水平調整必須設計到系統裡。Horizontal scale must be designed into the system. 例如,您可以將 VM 放在負載平衡器後方,以進行相應放大。For example, you can scale out VMs by placing them behind a load balancer. 但是在集區中的每個 VM 都必須能夠處理任何用戶端要求,因此,應用程式必須是無狀態模式或是將狀態儲存在外部 (例如,儲存在分散式快取)。But each VM in the pool must be able to handle any client request, so the application must be stateless or store state externally (say, in a distributed cache). 受管理的 PaaS 服務通常會有水平調整和內建的自動調整。Managed PaaS services often have horizontal scaling and autoscaling built in. 能夠輕鬆調整這些服務是使用 PaaS 服務的主要優勢。The ease of scaling these services is a major advantage of using PaaS services.

能夠輕鬆調整這些服務是使用 PaaS 服務的主要優勢之一。Just adding more instances doesn't mean an application will scale, however. 這種做法可能只是將效能瓶頸推向其他地方。It might simply push the bottleneck somewhere else. 例如,如果您調整 Web 前端來處理更多用戶端要求,這可能會讓資料庫觸發鎖定爭用。For example, if you scale a web front-end to handle more client requests, that might trigger lock contentions in the database. 然後,您就必須考慮採取其他措施 (例如,開放式並行存取或資料分割),才能提高進入資料庫的輸送量。You would then need to consider additional measures, such as optimistic concurrency or data partitioning, to enable more throughput to the database.

請一定要進行效能和負載測試,以找出這些潛在的瓶頸。Always conduct performance and load testing to find these potential bottlenecks. 系統的具狀態部分 (例如資料庫) 是最常造成瓶頸的因素,必須謹慎設計才能順利進行水平調整。The stateful parts of a system, such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally. 解決一個瓶頸可能會讓其他地方又出現瓶頸。Resolving one bottleneck may reveal other bottlenecks elsewhere.

請使用延展性檢查清單,從延展性的角度檢視您的設計。Use the Scalability checklist to review your design from a scalability standpoint.

延展性指導方針Scalability guidance

可用性Availability

可用性是系統正常運作並執行作業的時間比例。Availability is the proportion of time that the system is functional and working. 通常以運作時間百分比加以測量。It is usually measured as a percentage of uptime. 應用程式錯誤、基礎結構問題和系統負載全都有可能會降低可用性。Application errors, infrastructure problems, and system load can all reduce availability.

雲端應用程式應該會有服務等級目標 (SLO),以清楚定義預期的可用性以及可用性的測量方式。A cloud application should have a service level objective (SLO) that clearly defines the expected availability, and how the availability is measured. 在定義可用性時,應關注關鍵路徑。When defining availability, look at the critical path. Web 前端或許可以服務用戶端要求,但如果每一筆交易都因為它無法連線到資料庫而失敗,使用者就無法使用應用程式。The web front-end might be able to service client requests, but if every transaction fails because it can't connect to the database, the application is not available to users.

可用性通常是以「幾個 9」來描述,例如,「四個 9」表示 99.99% 的運作時間。Availability is often described in terms of "9s" — for example, "four 9s" means 99.99% uptime. 下表顯示各種可用性等級的潛在累計停機時間。The following table shows the potential cumulative downtime at different availability levels.

運作時間 %% Uptime 每週停機時間Downtime per week 每月停機時間Downtime per month 每年停機時間Downtime per year
99%99% 1.68 小時1.68 hours 7.2 小時7.2 hours 3.65 天3.65 days
99.9%99.9% 10 分鐘10 minutes 43.2 分鐘43.2 minutes 8.76 小時8.76 hours
99.95%99.95% 5 分鐘5 minutes 21.6 分鐘21.6 minutes 4.38 小時4.38 hours
99.99%99.99% 1 分鐘1 minute 4.32 分鐘4.32 minutes 52.56 分鐘52.56 minutes
99.999%99.999% 6 秒6 seconds 26 秒26 seconds 5.26 分鐘5.26 minutes

請注意,99% 的運作時間相當於每週會有將近 2 小時的服務中斷時間。Notice that 99% uptime could translate to an almost 2-hour service outage per week. 對於許多應用程式來說,尤其是消費者導向的應用程式,這樣的 SLO 令人無法接受。For many applications, especially consumer-facing applications, that is not an acceptable SLO. 另一方面,五個 9 (99.999%)表示不能超過五分鐘的停機時間年份On the other hand, five 9s (99.999%) means no more than five minutes of downtime in a year. 光是可以快速地偵測到服務發生中斷情形就已經很困難了,更別說要解決問題。It's challenging enough just detecting an outage that quickly, let alone resolving the issue. 若要有相當高的可用性 (99.99% 或以上),您不能想要靠人為方式從失敗中復原。To get very high availability (99.99% or higher), you can't rely on manual intervention to recover from failures. 應用程式必須能夠自我診斷和自我修復,因此復原功能就顯得重要。The application must be self-diagnosing and self-healing, which is where resiliency becomes crucial.

在 Azure 中,服務等級協定 (SLA) 描述 Microsoft 對執行時間與連線能力的承諾。In Azure, the Service Level Agreement (SLA) describes Microsoft's commitments for uptime and connectivity. 如果特定服務的 SLA 是 99.95%,表示您應該預期 99.95% 的時間皆可提供服務。If the SLA for a particular service is 99.95%, it means you should expect the service to be available 99.95% of the time.

應用程式經常依賴多個服務。Applications often depend on multiple services. 一般情況下,服務發生停機的機率是各自獨立的。In general, the probability of either service having downtime is independent. 例如,假設您的應用程式依賴兩個服務,這兩個服務各有 99.9% 的 SLA。For example, suppose your application depends on two services, each with a 99.9% SLA. 這兩個服務的複合 SLA 是 99.9% × 99.9% ≈ 99.8%,也就是略低於每個服務本身的 SLA。The composite SLA for both services is 99.9% × 99.9% ≈ 99.8%, or slightly less than each service by itself.

可用性指導方針Availability guidance

復原Resiliency

災害復原是指系統從失敗中復原並繼續運作的能力。Resiliency is the ability of the system to recover from failures and continue to function. 災害復原的目標是使應用程式在發生失敗後,能夠恢復到完全正常運作的狀態。The goal of resiliency is to return the application to a fully functioning state after a failure occurs. 恢復能力與可用性緊密相關。Resiliency is closely related to availability.

傳統上在開發應用程式時,會將焦點放在減少平均失敗時間 (MTBF)。In traditional application development, there has been a focus on reducing mean time between failures (MTBF). 心力會花在嘗試避免系統發生失敗。Effort was spent trying to prevent the system from failing. 在雲端運算中,則需要不同的思維,原因如下:In cloud computing, a different mindset is required, due to several factors:

  • 分散式系統很複雜,某處發生失敗很可能會讓整個系統發生連鎖反應。Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
  • 為了維持低成本,雲端環境會使用標準規格之硬體,因此一定會偶有硬體發生故障情形。Costs for cloud environments are kept low through the use of commodity hardware, so occasional hardware failures must be expected.
  • 應用程式經常依賴外部服務,但這些外部服務可能會暫時無法使用,或是針對大量使用者進行限流處理。Applications often depend on external services, which may become temporarily unavailable or throttle high-volume users.
  • 現今的使用者都期望應用程式應全天候都隨時可用,永遠不會離線。Today's users expect an application to be available 24/7 without ever going offline.

綜合上述之因素,雲端應用程式必須設計成預期會偶爾發生失敗並可以從中復原。All of these factors mean that cloud applications must be designed to expect occasional failures and recover from them. Azure 平台已內建許多災害復原功能。Azure has many resiliency features already built into the platform. 例如:For example:

  • Azure 儲存體、SQL Database 和 Cosmos DB 全都提供內建的資料複寫功能,不論是單一區域內或是跨區域皆有提供。Azure Storage, SQL Database, and Cosmos DB all provide built-in data replication, both within a region and across regions.
  • Azure 受控的磁碟會自動放在不同的儲存體縮放單位,以限制硬體故障的影響。Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware failures.
  • 可用性設定組中的 VM 會分散到數個容錯網域。VMs in an availability set are spread across several fault domains. 容錯網域是一組 VM,這些 VM 會共用電力來源和網路交換器。A fault domain is a group of VMs that share a common power source and network switch. 將 VM 分散到不同的容錯網域可以減少當實體硬體故障、網路中斷或電源中斷所產生的影響。Spreading VMs across fault domains limits the impact of physical hardware failures, network outages, or power interruptions.

話雖如此,您仍需要在應用程式中建置復原功能。That said, you still need to build resiliency into your application. 復原策略可以套用到系統架構中的所有層級。Resiliency strategies can be applied at all levels of the architecture. 某些緩和措施在本質上更有戰術優勢,例如,在暫時性的網路失敗後重試遠端呼叫。Some mitigations are more tactical in nature — for example, retrying a remote call after a transient network failure. 其他緩和措施則較有策略意義,例如將整個應用程式容錯移轉到次要區域。Other mitigations are more strategic, such as failing over the entire application to a secondary region. 戰術性緩和措施會產生極大的影響。Tactical mitigations can make a big difference. 雖然很少遇到整個區域都發生中斷的情況,但網路壅塞之類的暫時性問題並不罕見,因此請先以此做為標的。While it's rare for an entire region to experience a disruption, transient problems such as network congestion are more common — so target these first. 擁有適當的監視和診斷能力也很重要,這樣才能偵測到失敗的發生,並找出根本原因。Having the right monitoring and diagnostics is also important, both to detect failures when they happen, and to find the root causes.

在為應用程式設計復原功能時,您必須了解您的可用性需求。When designing an application to be resilient, you must understand your availability requirements. 可接受多少停機時間?How much downtime is acceptable? 這是有一部分是成本的功能。This is partly a function of cost. 可能的停機時間會造成多少業務成本?How much will potential downtime cost your business? 您應該將多少成本投資於此應用程式的高可用性?How much should you invest in making the application highly available?

災害復原指導方針Resiliency guidance

管理性和 DevOpsManagement and DevOps

此要素涵蓋讓應用程式在生產環境中執行的作業流程。This pillar covers the operations processes that keep an application running in production.

部署必須可靠且可預測。Deployments must be reliable and predictable. 請將部署程序自動化,以免發生人為錯誤。They should be automated to reduce the chance of human error. 部署程序應該迅速並成為例行工作,以免拖累新功能或錯誤修正的發行速度。They should be a fast and routine process, so they don't slow down the release of new features or bug fixes. 同樣重要的是,您必須能夠在更新發生問題時快速復原或向前復原。Equally important, you must be able to quickly roll back or roll forward if an update has problems.

監視和診斷能力極為重要。Monitoring and diagnostics are crucial. 雲端應用程式會在遠端資料中心內執行,而您並沒有其基礎結構或作業系統 (有時候) 的完整控制權。Cloud applications run in a remote datacenter where you do not have full control of the infrastructure or, in some cases, the operating system. 在大型應用程式中,登入 VM 來疑難排解問題或詳查記錄檔是不切實際的行為。In a large application, it's not practical to log into VMs to troubleshoot an issue or sift through log files. 在使用 PaaS 服務時,甚至可能沒有可供登入的專用 VM。With PaaS services, there may not even be a dedicated VM to log into. 監視和診斷能力可讓您深入了解系統,以在失敗發生時得知消息並知道其發生位置。Monitoring and diagnostics give insight into the system, so that you know when and where failures occur. 系統必須全都可供觀察。All systems must be observable. 請使用通用且一致的記錄結構描述,以便讓系統中的所有事件相互關聯。Use a common and consistent logging schema that lets you correlate events across systems.

監視和診斷程序有數個不同的階段:The monitoring and diagnostics process has several distinct phases:

  • 檢測。Instrumentation. 從應用程式記錄、Web 伺服器記錄、Azure 平台內建的診斷功能以及其他來源,產生未經處理的資料。Generating the raw data, from application logs, web server logs, diagnostics built into the Azure platform, and other sources.
  • 收集和儲存。Collection and storage. 將資料合併到一處。Consolidating the data into one place.
  • 分析及診斷。Analysis and diagnosis. 若要疑難排解問題,請參閱整體健康情況。To troubleshoot issues and see the overall health.
  • 視覺效果和警示。Visualization and alerts. 使用遙測資料來找出趨勢或對作業小組發出警示。Using telemetry data to spot trends or alert the operations team.

請使用 DevOps 檢查清單,從管理和 DevOps 的角度檢視您的設計。Use the DevOps checklist to review your design from a management and DevOps standpoint.

管理和 DevOps 指導方針Management and DevOps guidance

安全性Security

您必須思考應用程式從設計和實作到部署與作業之整個生命週期的安全性。You must think about security throughout the entire lifecycle of an application, from design and implementation to deployment and operations. Azure 平台可防範各種不同威脅,例如網路入侵和 DDoS 攻擊。The Azure platform provides protections against a variety of threats, such as network intrusion and DDoS attacks. 但您仍需要在應用程式和 DevOps 程序中建置安全性。But you still need to build security into your application and into your DevOps processes.

以下是部分需要考慮的安全性領域。Here are some broad security areas to consider.

身分識別管理Identity management

請考慮使用 Azure Active Directory (Azure AD) 來驗證並授權使用者。Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD 是完全受控的身分識別和存取管理服務。Azure AD is a fully managed identity and access management service. 您可以用它建立僅存在於 Azure 的網域,或用它整合您的內部部署 Active Directory 身分識別。You can use it to create domains that exist purely on Azure, or integrate with your on-premises Active Directory identities. Azure AD 也會整合 Office365、Dynamics CRM Online 和眾多第三方 SaaS 應用程式。Azure AD also integrates with Office365, Dynamics CRM Online, and many third-party SaaS applications. 對於消費者導向的應用程式,Azure Active Directory B2C 可讓使用者使用其現有的社交帳戶 (例如 Facebook、Google 或 LinkedIn) 進行驗證,或建立由 Azure AD 管理的新使用者帳戶。For consumer-facing applications, Azure Active Directory B2C lets users authenticate with their existing social accounts (such as Facebook, Google, or LinkedIn), or create a new user account that is managed by Azure AD.

如果您想要整合內部部署 Active Directory 環境與 Azure 網路,可行的方法有好幾個,端視您的需求而定。If you want to integrate an on-premises Active Directory environment with an Azure network, several approaches are possible, depending on your requirements. 如需詳細資訊,請參閱身分識別管理參考架構。For more information, see our Identity Management reference architectures.

保護您的基礎結構Protecting your infrastructure

請控制您所部署之 Azure 資源的存取權。Control access to the Azure resources that you deploy. 每個 Azure 訂用帳戶都會與 Azure AD 租用戶有信任關係Every Azure subscription has a trust relationship with an Azure AD tenant. 使用角色型存取控制 (RBAC) 授與使用者在組織內 Azure 資源的正確權限。Use role-based access control (RBAC) to grant users within your organization the correct permissions to Azure resources. 若要授與存取權,請將 RBAC 角色指派給某個範圍內的使用者或群組。Grant access by assigning RBAC role to users or groups at a certain scope. 此範圍可以是訂用帳戶、資源群組或是單一資源。The scope can be a subscription, a resource group, or a single resource. 對基礎結構的所有變更進行稽核Audit all changes to infrastructure.

應用程式安全性Application security

一般情況下,開發應用程式時的安全性最佳做法仍適用於雲端。In general, the security best practices for application development still apply in the cloud. 這些做法包括在所有地方使用 SSL、防範 CSRF 和 XSS 攻擊、防止 SQL 插入式攻擊等等。These include things like using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL injection attacks, and so on.

雲端應用程式通常會使用具有存取金鑰的受控服務。Cloud applications often use managed services that have access keys. 請永遠不要將這些存取金鑰簽入到原始檔控制。Never check these into source control. 您可以考慮將應用程式密碼儲存在 Azure Key Vault。Consider storing application secrets in Azure Key Vault.

資料主權和加密Data sovereignty and encryption

在使用 Azure 的高可用性功能時,請確保您的資料會留在正確的地緣政治區域內。Make sure that your data remains in the correct geopolitical zone when using Azure's highly available. Azure 的異地備援儲存體會在相同的地緣政治區域中使用配對區域的概念。Azure's geo-replicated storage uses the concept of a paired region in the same geopolitical region.

請使用 Key Vault 來保護密碼編譯金鑰和密碼。Use Key Vault to safeguard cryptographic keys and secrets. 藉由使用 Key Vault,您便可以使用硬體安全性模組 (HSM) 所保護的金鑰來加密金鑰和密碼。By using Key Vault, you can encrypt keys and secrets by using keys that are protected by hardware security modules (HSMs). 許多 Azure 儲存體和 DB 服務都支援待用資料加密,包括 Azure 儲存體Azure SQL DatabaseAzure SQL 資料倉儲Cosmos DBMany Azure storage and DB services support data encryption at rest, including Azure Storage, Azure SQL Database, Azure SQL Data Warehouse, and Cosmos DB.

安全性資源Security resources