水平、垂直和功能性資料分割Horizontal, vertical, and functional data partitioning

在許多大型解決方案中,資料會分割成可個別管理和存取的 資料分割。In many large-scale solutions, data is divided into partitions that can be managed and accessed separately. 資料分割可以改善延展性、減少爭用,以及最佳化效能。Partitioning can improve scalability, reduce contention, and optimize performance. 它也提供機制以根據使用模式來分隔資料。It can also provide a mechanism for dividing data by usage pattern. 例如,您可以將較舊的資料封存至成本較低的資料儲存空間。For example, you can archive older data in cheaper data storage.

但是,您必須仔細選擇資料分割策略,才能最大化利益,同時將不良影響降至最低。However, the partitioning strategy must be chosen carefully to maximize the benefits while minimizing adverse effects.


本中所用的「資料分割」一詞指的是實際將資料區分至個別資料存放區的程序。In this article, the term partitioning means the process of physically dividing data into separate data stores. 這和 SQL Server 的資料表分割不同。It is not the same as SQL Server table partitioning.

為何要分割資料?Why partition data?

  • 改善延展性Improve scalability. 當您相應增加單一資料庫系統時,其最終將會到達實體硬體限制。When you scale up a single database system, it will eventually reach a physical hardware limit. 如果您跨多個分割區來區分資料,而其中每一個分割區都裝載於個別伺服器上,您幾乎能夠無限制地相應放大系統。If you divide data across multiple partitions, each hosted on a separate server, you can scale out the system almost indefinitely.

  • 改善效能Improve performance. 在每個分割區上的資料存取作業會透過較小的資料磁碟區進行。Data access operations on each partition take place over a smaller volume of data. 正確完成的話,資料分割可讓您的系統更有效率。Correctly done, partitioning can make your system more efficient. 影響多個分割區的作業都能平行執行。Operations that affect more than one partition can run in parallel.

  • 改善安全性Improve security. 在某些情況下,您可以將敏感性資料和 nonsensitive 資料區分為不同的資料分割,並將不同的安全性控制項套用至機密資料。In some cases, you can separate sensitive and nonsensitive data into different partitions and apply different security controls to the sensitive data.

  • 提供做業彈性Provide operational flexibility. 資料分割提供許多微調作業的機會、將管理效率最大化,以及將成本降至最低。Partitioning offers many opportunities for fine-tuning operations, maximizing administrative efficiency, and minimizing cost. 例如,您可以根據資料在每個分割區中的重要性,來定義適用於管理、監視、備份和還原的不同原則,以及其他系統管理工作。For example, you can define different strategies for management, monitoring, backup and restore, and other administrative tasks based on the importance of the data in each partition.

  • 比對資料存放區和使用模式Match the data store to the pattern of use. 資料分割可根據資料存放區所提供的成本和內建功能,讓每個分割區部署在不同類型的資料存放區上。Partitioning allows each partition to be deployed on a different type of data store, based on cost and the built-in features that data store offers. 例如,大型二進位資料可儲存於 Blob 儲存體,而更為結構化的資料可保存於文件資料庫中。For example, large binary data can be stored in blob storage, while more structured data can be held in a document database. 請參閱選擇正確的資料存放區See Choose the right data store.

  • 改善可用性Improve availability. 跨多個伺服器分格資料可避免單一失敗點。Separating data across multiple servers avoids a single point of failure. 如果一個執行個體失敗,則只有該分割區中的資料無法使用。If one instance fails, only the data in that partition is unavailable. 在其他分割區上的作業可以繼續進行。Operations on other partitions can continue. 針對受控 PaaS 資料存放區,這項考量較不相關,因為這些服務已設計成具有內建備援。For managed PaaS data stores, this consideration is less relevant, because these services are designed with built-in redundancy.

移除分割區Designing partitions

分割資料的三個典型的策略是:There are three typical strategies for partitioning data:

  • 水平資料分割 (通常稱為「分區化」)。Horizontal partitioning (often called sharding). 在此策略中,每個分割區都是資料存放區,但所有分割區具有相同的結構描述。In this strategy, each partition is a separate data store, but all partitions have the same schema. 每個分割區都稱為「分區」,而且會保存特定的資料子集,例如,一組特定客戶的所有訂單。Each partition is known as a shard and holds a specific subset of the data, such as all the orders for a specific set of customers.

  • 垂直資料分割Vertical partitioning. 在此策略中,每個分割區會在資料存放區中保留項目的欄位子集。In this strategy, each partition holds a subset of the fields for items in the data store. 欄位會根據其使用模式來區分。The fields are divided according to their pattern of use. 比方說,經常存取的欄位可能會放在一個垂直的分割區中,而較不常存取的欄位則放置於另一個分割區中。For example, frequently accessed fields might be placed in one vertical partition and less frequently accessed fields in another.

  • 功能資料分割Functional partitioning. 在此策略中,資料會根據系統中每個繫結的內容使用它的方式進行彙總。In this strategy, data is aggregated according to how it is used by each bounded context in the system. 例如,電子商務系統可能會將發票資料儲存在某個分割區,而將產品庫存資料儲存在另一個分割區。For example, an e-commerce system might store invoice data in one partition and product inventory data in another.

這些策略可以合併,建議您在設計資料分割配置時應全部納入考量。These strategies can be combined, and we recommend that you consider them all when you design a partitioning scheme. 例如,您可能會將資料區分成分區,然後使用垂直資料分割進一步細分每個分區中的資料。For example, you might divide data into shards and then use vertical partitioning to further subdivide the data in each shard.

水平資料分割 (分區化)Horizontal partitioning (sharding)

圖 1 顯示水平資料分割或分區化。Figure 1 shows horizontal partitioning or sharding. 在此範例中,產品庫存資料會根據產品索引鍵區分成分區。In this example, product inventory data is divided into shards based on the product key. 每個分區都保存分區索引鍵 (A-G 和 H-Z) 的連續範圍資料,依照字母順序排列。Each shard holds the data for a contiguous range of shard keys (A-G and H-Z), organized alphabetically. 分區化會將負載分散到多部電腦,以減少爭用並改善效能。Sharding spreads the load over more computers, which reduces contention and improves performance.

水平資料分割 (分區化) 的資料是以分割區索引鍵為基礎

圖1:水準資料分割 (分區化根據資料分割索引鍵) 資料。Figure 1 - Horizontally partitioning (sharding) data based on a partition key.

最重要的因素是分區金鑰的選擇。The most important factor is the choice of a sharding key. 系統在作業之後,就很難變更索引鍵。It can be difficult to change the key after the system is in operation. 金鑰必須確定資料已分割,使工作負載盡可能跨分區平均分配。The key must ensure that data is partitioned to spread the workload as evenly as possible across the shards.

分區的大小不一定要相同。The shards don't have to be the same size. 務必讓要求數目平衡。It's more important to balance the number of requests. 有些分區可能非常大,但每個項目都只有少量存取作業。Some shards might be very large, but each item has a low number of access operations. 其他的分區可能比較小,但是更常存取每個項目。Other shards might be smaller, but each item is accessed much more frequently. 另一個重點是確保單一分區不會超過資料存放區的規模限制 (以容量和處理資源為準)。It's also important to ensure that a single shard does not exceed the scale limits (in terms of capacity and processing resources) of the data store.

請避免建立「熱點」分割區,其會影響效能和可用性。Avoid creating "hot" partitions that can affect performance and availability. 例如,使用客戶名稱的第一個字母會導致不對稱的散發,因為有些字母較為常見。For example, using the first letter of a customer's name causes an unbalanced distribution, because some letters are more common. 相反地,請使用客戶識別碼的雜湊,在分割區之間更平均地散發資料。Instead, use a hash of a customer identifier to distribute data more evenly across partitions.

選擇可最小化任何未來需求的分區金鑰,以分割大型分區、將小型分區聯合成較大的分割區,或者變更結構描述。Choose a sharding key that minimizes any future requirements to split large shards, coalesce small shards into larger partitions, or change the schema. 這些作業非常耗時,而且可能需要在執行時讓一或多個分區離線。These operations can be very time consuming, and might require taking one or more shards offline while they are performed.

如果複寫分區,某些複本可能要保持上線,而其他複本會被劃分、合併或重新設定。If shards are replicated, it might be possible to keep some of the replicas online while others are split, merged, or reconfigured. 但在進行重新設定時,系統可能需要限制可以執行的作業。However, the system might need to limit the operations that can be performed during the reconfiguration. 例如,複本中的資料可能會標示為唯讀,以避免資料不一致。For example, the data in the replicas might be marked as read-only to prevent data inconsistences.

如需有關水準資料分割的詳細資訊,請參閱 分區化模式For more information about horizontal partitioning, see sharding pattern.

垂直資料分割Vertical partitioning

垂直資料分割的最常見用途是可降低與擷取最常存取之項目相關聯的 I/O 和效能成本。The most common use for vertical partitioning is to reduce the I/O and performance costs associated with fetching items that are frequently accessed. 圖 2 顯示垂直資料分割的範例。Figure 2 shows an example of vertical partitioning. 在此範例中,項目的不同屬性都會儲存在不同的分割區中。In this example, different properties of an item are stored in different partitions. 有一個分割區會保留較常存取的資料,包括產品名稱、描述和價格。One partition holds data that is accessed more frequently, including product name, description, and price. 另一個分割區會保留庫存資料:庫存計數和上次訂購日期。Another partition holds inventory data: the stock count and last-ordered date.


圖 2-依使用模式垂直分割資料。Figure 2 - Vertically partitioning data by its pattern of use.

在此範例中,應用程式會在向客戶顯示產品詳細資料時,固定查詢產品名稱、描述和價格。In this example, the application regularly queries the product name, description, and price when displaying the product details to customers. 庫存計數和上次訂購日期會保留在個別的分割區,因為這兩個項目通常會一起使用。Stock count and last- ordered date are held in a separate partition because these two items are commonly used together.

垂直資料分割的其他優點:Other advantages of vertical partitioning:

  • 移動頻率相當低的資料 (產品名稱、描述和價格) 可以和較動態的資料 (存貨量和上一次訂單日期) 分開。Relatively slow-moving data (product name, description, and price) can be separated from the more dynamic data (stock level and last ordered date). 移動頻率低的資料是應用程式在記憶體中快取的良好候選項目。Slow moving data is a good candidate for an application to cache in memory.

  • 機密資料可以儲存在具有其他安全性控制項的個別分割區中。Sensitive data can be stored in a separate partition with additional security controls.

  • 垂直資料分割可以減少所需的並行存取數量。Vertical partitioning can reduce the amount of concurrent access that's needed.

垂直資料分割都是在資料存放區內的實體層級運做,有部分會正規化實體,將其從「廣泛」項目細分成一組「縮小」項目。Vertical partitioning operates at the entity level within a data store, partially normalizing an entity to break it down from a wide item to a set of narrow items. 在理想的情況下,它適用於 HBase 和 Cassandra 等資料行導向的資料存放區。It is ideally suited for column-oriented data stores such as HBase and Cassandra. 如果資料行集合中的資料不太可能變更,您也可以考慮使用 SQL Server 中的資料行存放區。If the data in a collection of columns is unlikely to change, you can also consider using column stores in SQL Server.

功能資料分割Functional partitioning

可以在應用程式中為每個不同的商業領域識別繫結內容時,功能資料分割是可改善隔離和資料存取效能的一種方法。When it's possible to identify a bounded context for each distinct business area in an application, functional partitioning is a way to improve isolation and data access performance. 功能資料分割的另一種常用功能是將讀寫資料與唯讀資料分開。Another common use for functional partitioning is to separate read-write data from read-only data. 圖 3 顯示功能資料分割的概觀,清查資料可從客戶的資料分開。Figure 3 shows an overview of functional partitioning where inventory data is separated from customer data.


圖 3-依系結內容或子域的功能分割資料。Figure 3 - Functionally partitioning data by bounded context or subdomain.

此資料分割策略有助於減少跨系統的不同部分所發生的資料存取爭用。This partitioning strategy can help reduce data access contention across different parts of a system.

設計延展性的分割區Designing partitions for scalability

請務必考慮每個分割區的大小和工作負載並加以平衡,使資料分佈以達到最大延展性。It's vital to consider size and workload for each partition and balance them so that data is distributed to achieve maximum scalability. 不過,您也必須分割資料,使它不會超過單一資料分割存放區的調整限制。However, you must also partition the data so that it does not exceed the scaling limits of a single partition store.

設計具延展性的分割區時,請遵循下列步驟:Follow these steps when designing partitions for scalability:

  1. 分析應用程式以了解資料存取模式,例如每個查詢所傳回的結果集大小、存取的頻率、固有的延遲,以及伺服器端計算處理需求。Analyze the application to understand the data access patterns, such as the size of the result set returned by each query, the frequency of access, the inherent latency, and the server-side compute processing requirements. 在許多情況下,幾個主要實體會要求大部分的處理資源。In many cases, a few major entities will demand most of the processing resources.
  2. 使用此分析來判斷目前和未來的延展性目標,例如資料大小和工作負載。Use this analysis to determine the current and future scalability targets, such as data size and workload. 然後將資料分散在各個分割區上,以符合延展性目標。Then distribute the data across the partitions to meet the scalability target. 針對水平資料分割,選擇適當的分區金鑰,這對確定分佈是否平均很重要。For horizontal partitioning, choosing the right shard key is important to make sure distribution is even. 如需詳細資訊,請參閱 分區化模式For more information, see the sharding pattern.
  3. 請確定每個分割區具有足夠資源,可處理資料大小和輸送量方面的延展性需求。Make sure each partition has enough resources to handle the scalability requirements, in terms of data size and throughput. 根據資料存放區的不同,儲存空間、處理能力或每個分割區的網路頻寬可能會有所限制。Depending on the data store, there might be a limit on the amount of storage space, processing power, or network bandwidth per partition. 如果需求可能會超過這些限制,您就可能需要調整您的資料分割策略或進一步劃分資料,也許要合併兩個以上的策略。If the requirements are likely to exceed these limits, you may need to refine your partitioning strategy or split data out further, possibly combining two or more strategies.
  4. 監視系統以確認資料會如預期般分佈,而且分割區可以處理負載。Monitor the system to verify that data is distributed as expected and that the partitions can handle the load. 實際的使用方式不一定符合分析預測。Actual usage does not always match what an analysis predicts. 如果是這樣,可能可以重新平衡分割區,或是重新設計系統的某些部分以取得必要的平衡。If so, it might be possible to rebalance the partitions, or else redesign some parts of the system to gain the required balance.

某些雲端環境會根據基礎結構界限配置資源。Some cloud environments allocate resources in terms of infrastructure boundaries. 您應該確定您所選界限的限制可在資料儲存體、處理能力及頻寬等方面,提供足夠的空間,使資料量能夠如預期般成長。Ensure that the limits of your selected boundary provide enough room for any anticipated growth in the volume of data, in terms of data storage, processing power, and bandwidth.

例如如果您使用 Azure 資料表儲存空間,對於單一磁碟區可在一段特定期間內處理的要求數量是有限制的。For example, if you use Azure table storage, there is a limit to the volume of requests that can be handled by a single partition in a particular period of time. (需詳細資訊,請參閱 Azure 儲存體的擴充性和效能目標。 ) 忙碌的分區可能需要比單一分割區可處理的資源還多。(For more information, see Azure storage scalability and performance targets.) A busy shard might require more resources than a single partition can handle. 如果是這樣,可能需要重新分割分區以散佈負載。If so, the shard might need to be repartitioned to spread the load. 如果這些資料表的總大小或輸送量超過儲存體帳戶的容量,您可能必須建立其他儲存體帳戶並跨這些帳戶散佈資料表。If the total size or throughput of these tables exceeds the capacity of a storage account, you might need to create additional storage accounts and spread the tables across these accounts.

設計查詢效能的分割區Designing partitions for query performance

使用較小的資料集和執行平行查詢,通常可提高查詢效能。Query performance can often be boosted by using smaller data sets and by running parallel queries. 每個分割區都應包含整個資料集的一小部分。Each partition should contain a small proportion of the entire data set. 數量的縮減可以改善查詢效能。This reduction in volume can improve the performance of queries. 不過,資料分割並不是適當地設計和設定資料庫的替代方式。However, partitioning is not an alternative for designing and configuring a database appropriately. 例如,請確定您已備妥必要的索引。For example, make sure that you have the necessary indexes in place.

基於查詢效能設計分割區時,請遵循下列步驟:Follow these steps when designing partitions for query performance:

  1. 檢查應用程式的需求以及效能:Examine the application requirements and performance:

    • 使用商務需求來判斷隨時必須快速執行的重要查詢。Use business requirements to determine the critical queries that must always perform quickly.
    • 監視系統以識別任何執行速度慢的查詢。Monitor the system to identify any queries that perform slowly.
    • 尋找最常執行的查詢。Find which queries are performed most frequently. 即使單一查詢的成本最低,但是累計資源耗用量可能相當高。Even if a single query has a minimal cost, the cumulative resource consumption could be significant.
  2. 分割會導致效能變慢的資料:Partition the data that is causing slow performance:

    • 限制每個分割區的大小,使查詢回應時間在目標內。Limit the size of each partition so that the query response time is within target.
    • 如果您使用水平資料分割,請設計分區金鑰,讓應用程式可以輕鬆地選取分割區。If you use horizontal partitioning, design the shard key so that the application can easily select the right partition. 這可防止查詢需要掃描每個分割區。This prevents the query from having to scan through every partition.
    • 請考慮分割區的位置。Consider the location of a partition. 如果可能,請嘗試將資料保留在地理位置靠近存取它之應用程式和使用者的分割區。If possible, try to keep data in partitions that are geographically close to the applications and users that access it.
  3. 如果實體有輸送量和查詢效能的需求,請根據該實體使用功能資料分割。If an entity has throughput and query performance requirements, use functional partitioning based on that entity. 如果這樣還是無法滿足需求,請同時套用水平資料分割。If this still doesn't satisfy the requirements, apply horizontal partitioning as well. 在大部分的情況下,單一資料分割策略就夠了,但是在某些情況下,結合這兩種策略會更有效率。In most cases, a single partitioning strategy will suffice, but in some cases it is more efficient to combine both strategies.

  4. 請考慮跨分割區平行執行查詢,以改善效能。Consider running queries in parallel across partitions to improve performance.

設計可用性的分割區Designing partitions for availability

分割資料可以確保整個資料集不會構成單一失敗點,而且確保資料集的個別子集可以分開管理,藉以改善應用程式的可用性。Partitioning data can improve the availability of applications by ensuring that the entire dataset does not constitute a single point of failure and that individual subsets of the dataset can be managed independently.

請考慮下列會影響可用性的因素:Consider the following factors that affect availability:

資料對商務營運的重要性How critical the data is to business operations. 識別哪些資料是重要商務資訊,例如交易,而哪些資料是較不重要的作業資料,例如記錄檔。Identify which data is critical business information, such as transactions, and which data is less critical operational data, such as log files.

  • 請考慮使用適當的備份計畫,將重要資料儲存在高度可用的資料分割中。Consider storing critical data in highly available partitions with an appropriate backup plan.

  • 針對不同的資料集建立個別的管理和監視程序。Establish separate management and monitoring procedures for the different datasets.

  • 將具有相同嚴重性等級的資料放在同一個分割區,利用適當的頻率一併進行備份。Place data that has the same level of criticality in the same partition so that it can be backed up together at an appropriate frequency. 例如,保留交易資料的分割區的備份頻率可能必須高於保留記錄或追蹤資訊的分割區。For example, partitions that hold transaction data might need to be backed up more frequently than partitions that hold logging or trace information.

個別分割區的管理方式How individual partitions can be managed. 將分割區設計為支援獨立管理和維護可提供數個優點。Designing partitions to support independent management and maintenance provides several advantages. 例如:For example:

  • 如果分割區失敗,可以獨立復原而不會影響在其他分割區中存取資料的應用程式。If a partition fails, it can be recovered independently without applications that access data in other partitions.

  • 依地理區域分割資料,允許已排程的維護工作在每個位置的離峰時段進行。Partitioning data by geographical area allows scheduled maintenance tasks to occur at off-peak hours for each location. 請確定分割區不太大,以防止在這段期間內完成任何預定的維護。Ensure that partitions are not too large to prevent any planned maintenance from being completed during this period.

是否要跨分割區複寫重要資料Whether to replicate critical data across partitions. 此策略可以改善可用性和效能,不過它也會導入一致性問題。This strategy can improve availability and performance, but can also introduce consistency issues. 需要時間來同步處理每個複本的變更。It takes time to synchronize changes with every replica. 在這段期間,不同的分割區會包含不同的資料值。During this period, different partitions will contain different data values.

應用程式設計考量Application design considerations

資料分割會增加系統設計和開發的複雜度。Partitioning adds complexity to the design and development of your system. 即使系統一開始只包含單一分割區,也請考慮將資料分割視為系統設計的基本部分。Consider partitioning as a fundamental part of system design even if the system initially only contains a single partition. 如果您事後才處理資料分割,因為您已經有需要維護的即時系統,因此更具挑戰性:If you address partitioning as an afterthought, it will be more challenging because you already have a live system to maintain:

  • 需要修改資料存取邏輯。Data access logic will need to be modified.
  • 可能需要遷移大量的現有資料,以將其分散到多個分割區。Large quantities of existing data may need to be migrated, to distribute it across partitions.
  • 使用者期望可以在移轉期間繼續使用系統。Users expect to be able to continue using the system during the migration.

在某些情況下,資料分割並不重要,因為初始資料集很小,而且可以輕鬆地由單一伺服器處理。In some cases, partitioning is not considered important because the initial dataset is small and can be easily handled by a single server. 對於部分工作負載來說可能是如此,但是許多商務系統需要隨著使用者數目增加而擴充。This might be true for some workloads, but many commercial systems need to expand as the number of users increases.

此外,不只有大型資料存放區受益於資料分割。Moreover, it's not only large data stores that benefit from partitioning. 例如,數百個並行用戶端可能會大量存取一個小型資料存放區。For example, a small data store might be heavily accessed by hundreds of concurrent clients. 在此情況下將資料分割可以協助減少爭用並提高輸送量。Partitioning the data in this situation can help to reduce contention and improve throughput.

當您設計資料分割配置時,應考慮下列幾點:Consider the following points when you design a data partitioning scheme:

最小化跨分割區資料存取作業Minimize cross-partition data access operations. 盡可能一併保留每個分割區中最常見資料庫作業的資料,以使跨分割區的資料存取作業減到最少。Where possible, keep data for the most common database operations together in each partition to minimize cross-partition data access operations. 跨分割區查詢可能比在單一分割區內查詢更費時,但是最佳化一組查詢的分割區可能會對其他組的查詢造成不良影響。Querying across partitions can be more time-consuming than querying within a single partition, but optimizing partitions for one set of queries might adversely affect other sets of queries. 如果您必須跨分割區查詢,可以在應用程式內執行平行查詢並彙總結果,來將查詢時間降至最低。If you must query across partitions, minimize query time by running parallel queries and aggregating the results within the application. (在某些情況下可能無法使用這種方法,例如,從某個查詢中取得的結果在下一個查詢使用時。)(This approach might not be possible in some cases, such as when the result from one query is used in the next query.)

請考慮複寫靜態參考資料。Consider replicating static reference data. 如果查詢使用相對靜態的參考資料 (例如,郵遞區號資料表或產品清單),請考慮將此資料複寫到所有分割區,以減少在不同分割區中個別查閱作業。If queries use relatively static reference data, such as postal code tables or product lists, consider replicating this data in all of the partitions to reduce separate lookup operations in different partitions. 這種方法也會減少參考資料成為「熱門」資料集的可能性,具有整個系統中的高流量。This approach can also reduce the likelihood of the reference data becoming a "hot" dataset, with heavy traffic from across the entire system. 不過,還是會有與同步處理此參考資料的任何變更相關聯的其他成本。However, there is an additional cost associated with synchronizing any changes to the reference data.

最小化跨分割區聯結。Minimize cross-partition joins. 盡可能最小化跨垂直和功能分割區之參考完整性的需求。Where possible, minimize requirements for referential integrity across vertical and functional partitions. 在這些配置中,應用程式會負責維護跨分割區的參考完整性。In these schemes, the application is responsible for maintaining referential integrity across partitions. 跨多個分割區聯結資料的查詢效率不佳,因為應用程式通常必須先根據索引鍵,接著根據外部索引鍵來執行連續查詢。Queries that join data across multiple partitions are inefficient because the application typically needs to perform consecutive queries based on a key and then a foreign key. 請改為考慮將相關資料複寫或取消正規化。Instead, consider replicating or de-normalizing the relevant data. 如果需要跨分割區聯結,請在分割區之間執行平行查詢,並在應用程式內聯結資料。If cross-partition joins are necessary, run parallel queries over the partitions and join the data within the application.

擁有最終一致性Embrace eventual consistency. 評估強式一致性是否為實際的需求。Evaluate whether strong consistency is actually a requirement. 分散式系統中的常見方法是實作最終一致性。A common approach in distributed systems is to implement eventual consistency. 每個分割區中的資料會個別更新,而應用程式邏輯可確保所有更新都會順利完成。The data in each partition is updated separately, and the application logic ensures that the updates are all completed successfully. 它也會在執行最終一致性作業時,處理查詢資料所引發的不一致性。It also handles the inconsistencies that can arise from querying data while an eventually consistent operation is running.

請考慮查詢如何尋找正確的分割區Consider how queries locate the correct partition. 如果查詢必須掃描所有分割區來尋找所需的資料,即使是有多個平行查詢正在執行,還是會對效能產生嚴重的影響。If a query must scan all partitions to locate the required data, there is a significant impact on performance, even when multiple parallel queries are running. 搭配垂直和功能資料分割策略使用的查詢可以自然指定分割區。With vertical and functional partitioning, queries can naturally specify the partition. 但另一方面,水平資料分割會使得尋找項目變得困難,因為每個分區都有相同的結構描述。Horizontal partitioning, on the other hand, can make locating an item difficult, because every shard has the same schema. 典型解決方案是維護對應,用來查閱特定資料項目的分區位置。A typical solution to maintain a map that is used to look up the shard location for specific items. 此對應會在應用程式的分區化邏輯中實作,或者如果它支援透明的分區化,就會由資料存放區維護。This map can be implemented in the sharding logic of the application, or maintained by the data store if it supports transparent sharding.

請考慮定期重新平衡分區Consider periodically rebalancing shards. 使用水平資料分割,重新平衡分區有助於根據大小和工作負載來平均分佈資料,進而最小化作用點、最大化查詢效能,並解決實體的儲存體限制。With horizontal partitioning, rebalancing shards can help distribute the data evenly by size and by workload to minimize hotspots, maximize query performance, and work around physical storage limitations. 不過,這是一個複雜的工作,通常需要使用自訂工具或程序。However, this is a complex task that often requires the use of a custom tool or process.

複寫資料分割。Replicate partitions. 如果您複寫每個分割區,就能提供額外的保護以防止發生錯誤。If you replicate each partition, it provides additional protection against failure. 如果單一複本失敗,查詢可以導向至工作複本。If a single replica fails, queries can be directed toward a working copy.

如果您達到資料分割策略的實體限制,您可能必須將延展性擴充至不同層級If you reach the physical limits of a partitioning strategy, you might need to extend the scalability to a different level. 例如,如果資料分割是在資料庫層級,您可能需要尋找或複寫多個資料庫中的分割區。For example, if partitioning is at the database level, you might need to locate or replicate partitions in multiple databases. 如果資料分割已經在資料庫層級,而且發生實體限制的問題,可能表示您需要尋找或複寫多個裝載帳戶中的分割區。If partitioning is already at the database level, and physical limitations are an issue, it might mean that you need to locate or replicate partitions in multiple hosting accounts.

避免在多個分割區中存取資料的交易Avoid transactions that access data in multiple partitions. 某些資料存放區會為了修改資料的作業而實作交易一致性和完整性,但唯有當資料位於單一分割區時才如此。Some data stores implement transactional consistency and integrity for operations that modify data, but only when the data is located in a single partition. 如果您需要跨多個分割區的交易式支援,您可能必須實作此支援做為應用程式邏輯的一部分,因為大部分的資料分割系統不會提供原生支援。If you need transactional support across multiple partitions, you will probably need to implement this as part of your application logic because most partitioning systems do not provide native support.

所有資料存放區都需要某些作業管理和監視活動。All data stores require some operational management and monitoring activity. 工作的範圍可包含載入資料、備份和還原資料、重新組織資料,以及確保系統正確、有效率地執行。The tasks can range from loading data, backing up and restoring data, reorganizing data, and ensuring that the system is performing correctly and efficiently.

請考慮下列會影響作業管理的因素:Consider the following factors that affect operational management:

  • 對分割資料時,如何實作適當的管理和操作工作How to implement appropriate management and operational tasks when the data is partitioned. 這些工作可能包括備份與還原、封存資料、監視系統,以及其他管理工作。These tasks might include backup and restore, archiving data, monitoring the system, and other administrative tasks. 例如,在備份和還原作業期間維護邏輯一致性是一項挑戰。For example, maintaining logical consistency during backup and restore operations can be a challenge.

  • 如何將資料載入多個分割區,並新增從其他來源送達的新資料How to load the data into multiple partitions and add new data that's arriving from other sources. 某些工具和公用程式可能不支援分區化資料作業,例如,將資料載入正確的分割區。Some tools and utilities might not support sharded data operations such as loading data into the correct partition.

  • 如何定期封存及刪除資料How to archive and delete the data on a regular basis. 若要防止分割區過度成長,您需要定期封存和刪除資料 (例如每月) 。To prevent the excessive growth of partitions, you need to archive and delete data on a regular basis (such as monthly). 可能需要轉換資料,以符合不同的封存結構描述。It might be necessary to transform the data to match a different archive schema.

  • 如何找出資料完整性問題How to locate data integrity issues. 請考慮定期執行程序以尋找任何資料完整性問題,例如,某一個分割區中的資料會參考另一個分割區中遺失的資訊。Consider running a periodic process to locate any data integrity issues, such as data in one partition that references missing information in another. 此程式可嘗試自動修正這些問題,或產生手動審核的報告。The process can either attempt to fix these issues automatically or generate a report for manual review.

重新平衡分割區Rebalancing partitions

隨著系統成熟,您可能必須調整資料分割配置。As a system matures, you might have to adjust the partitioning scheme. 例如,個別的分割區可能會開始取得不相稱的流量,並變得很忙碌,導致爭用過度。For example, individual partitions might start getting a disproportionate volume of traffic and become hot, leading to excessive contention. 或者您可能低估某些分割區中的資料量,導致某些分割區達到容量限制。Or you might have underestimated the volume of data in some partitions, causing some partitions to approach capacity limits.

某些資料存放區 (例如 Cosmos DB) 會自動重新平衡分割區。Some data stores, such as Cosmos DB, can automatically rebalance partitions. 在其他情況下,重新平衡是由兩個階段組成的系統管理工作:In other cases, rebalancing is an administrative task that consists of two stages:

  1. 判斷新的資料分割策略。Determine a new partitioning strategy.

    • 哪些分割區必須劃分 (或可能必須結合)?Which partitions need to be split (or possibly combined)?
    • 新的分割區索引鍵是什麼?What is the new partition key?
  2. 將資料從舊的資料分割配置遷移至一組新的分割區。Migrate data from the old partitioning scheme to the new set of partitions.

根據資料存放區的不同,您可以在分割區正在使用中時,在它們之間遷移資料。Depending on the data store, you might be able to migrate data between partitions while they are in use. 這就是所謂的「線上移轉」。This is called online migration. 如果此方法不可行,您可能必須在重新定位資料時讓分割區暫時無法使用 (「離線移轉」)。If that's not possible, you might need to make partitions unavailable while the data is relocated (offline migration).

離線移轉Offline migration

離線遷移通常更簡單,因為它可以減少發生爭用的機會。Offline migration is typically simpler because it reduces the chances of contention occurring. 就概念而言,離線移轉的運作方式如下所示:Conceptually, offline migration works as follows:

  1. 將分割區標示為離線。Mark the partition offline.
  2. 劃分-合併資料,並將其移到新的分割區。Split-merge and move the data to the new partitions.
  3. 驗證資料。Verify the data.
  4. 讓新的分割區上線。Bring the new partitions online.
  5. 移除舊的分割區。Remove the old partition.

(選擇性) 您可以在步驟 1 中將分割區標示為唯讀,讓應用程式在資料移動時,仍然可以讀取資料。Optionally, you can mark a partition as read-only in step 1, so that applications can still read the data while it is being moved.

線上移轉Online migration

線上移轉執行上較複雜,但是比較不會受到干擾。Online migration is more complex to perform but less disruptive. 其程序與離線移轉相似,不同之處在於原始分割區未標示為離線。The process is similar to offline migration, except the original partition is not marked offline. 根據移轉程序的細微性 (例如,逐項目與逐分區),用戶端應用程式中的資料存取程式碼可能必須處理保留在兩個位置 (原始分割區和新分割區) 中之資料的讀取和寫入。Depending on the granularity of the migration process (for example, item by item versus shard by shard), the data access code in the client applications might have to handle reading and writing data that's held in two locations, the original partition and the new partition.

下列設計模式可能會與您的案例相關:The following design patterns might be relevant to your scenario:

  • 分區化模式說明一些分區化資料的常見策略。The sharding pattern describes some common strategies for sharding data.

  • 索引資料表模式示範如何建立資料的次要索引。The index table pattern shows how to create secondary indexes over data. 應用程式可以使用未參考集合的主索引鍵的查詢,透過這個方法快速擷取資料。An application can quickly retrieve data with this approach, by using queries that do not reference the primary key of a collection.

  • 具體化視圖模式說明如何產生預先填入的視圖,以摘要資料來支援快速查詢作業。The materialized view pattern describes how to generate prepopulated views that summarize data to support fast query operations. 如果包含已摘要列出之資料的分割區會跨多個網站分佈,則這個方法對分割的資料存放區很有幫助。This approach can be useful in a partitioned data store if the partitions that contain the data being summarized are distributed across multiple sites.

下一步Next steps