選擇正確的資料存放區Choose the right data store

現代商務系統管理著越來越大的資料量。Modern business systems manage increasingly large volumes of data. 資料可能會是從外部服務內嵌、由系統本身產生,或者由使用者建立。Data may be ingested from external services, generated by the system itself, or created by users. 這些資料集可能具有非常不同的特性和處理需求。These data sets may have extremely varied characteristics and processing requirements. 企業會使用資料來評估趨勢、觸發商務程序、稽核其作業、分析客戶行為以及其他許多事項。Businesses use data to assess trends, trigger business processes, audit their operations, analyze customer behavior, and many other things.

此異質性意味著單一資料存放區通常不是最好的方法。This heterogeneity means that a single data store is usually not the best approach. 相反地,它通常是比較適合將不同的資料存放區中儲存不同類型的資料,每個向特定的工作負載或使用方式模式已取得焦點。Instead, it's often better to store different types of data in different data stores, each focused toward a specific workload or usage pattern. 「多語制持續性」一詞是用來描述使用混合資料存放區技術的解決方案。The term polyglot persistence is used to describe solutions that use a mix of data store technologies.

針對您的需求選取正確的資料存放區,是關鍵的設計決策。Selecting the right data store for your requirements is a key design decision. 實際上有數百個實作可以從 SQL 和 NoSQL 資料庫中選擇。There are literally hundreds of implementations to choose from among SQL and NoSQL databases. 資料存放區通常是依據它們的結構資料和它們所支援的作業類型來分類。Data stores are often categorized by how they structure data and the types of operations they support. 這篇文章描述一些最常見的儲存體模型。This article describes several of the most common storage models. 請注意,特定資料存放區技術可能支援多個儲存體模型。Note that a particular data store technology may support multiple storage models. 例如,關聯式資料庫管理系統 (RDBMS) 也可支援索引鍵/值或圖表儲存體。For example, a relational database management systems (RDBMS) may also support key/value or graph storage. 事實上,沒有的一般趨勢所謂多重模型支援,其中單一資料庫系統支援數個模型。In fact, there is a general trend for so-called multi-model support, where a single database system supports several models. 但是了解高階的不同模型仍然相當有用。But it's still useful to understand the different models at a high level.

並非指定類別中的所有資料存放區都提供相同的功能集。Not all data stores in a given category provide the same feature-set. 大部分資料存放區提供伺服器端功能,來查詢及處理資料。Most data stores provide server-side functionality to query and process data. 有時候這項功能已內建於資料儲存引擎。Sometimes this functionality is built into the data storage engine. 在其他情況下,資料儲存和處理功能會分隔,可能有數個處理和分析的選項。In other cases, the data storage and processing capabilities are separated, and there may be several options for processing and analysis. 資料存放區也支援不同的程式設計和管理介面。Data stores also support different programmatic and management interfaces.

一般而言,您應該從考量哪個儲存體模型最適合您的需求開始。Generally, you should start by considering which storage model is best suited for your requirements. 然後根據例如功能集、成本和輕鬆管理的因素,考慮該分類中的特定資料存放區。Then consider a particular data store within that category, based on factors such as feature set, cost, and ease of management.

關聯式資料庫管理系統Relational database management systems

關聯式資料庫會將資料組織為具有資料列和資料行之二維資料表的系列。Relational databases organize data as a series of two-dimensional tables with rows and columns. 每個資料表有它自己的資料行,且資料表中的每個資料列擁有相同的資料行集合。Each table has its own columns, and every row in a table has the same set of columns. 這個模型以數學方式為基礎,大部分廠商提供結構化查詢語言 (SQL) 的方言來擷取及管理資料。This model is mathematically based, and most vendors provide a dialect of the Structured Query Language (SQL) for retrieving and managing data. RDBMS 通常會實作符合 ACID (不可部分完成性、一致性、隔離性、耐用性) 模型的交易一致性機制,以更新資訊。An RDBMS typically implements a transactionally consistent mechanism that conforms to the ACID (Atomic, Consistent, Isolated, Durable) model for updating information.

RDBMS 通常支援 schema-on-write 模型,其中資料結構會預先定義,而所有讀取或寫入作業必須使用結構描述。An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead of time, and all read or write operations must use the schema. 這與大部分 NoSQL 資料存放區相反,特別是索引鍵/值類型,其中 schema-on-read 模型假設用戶端將會自己的解釋性結構描述加諸於來自資料庫的資料,而且對於寫入的資料格式無從驗證。This is in contrast to most NoSQL data stores, particularly key/value types, where the schema-on-read model assumes that the client will be imposing its own interpretive schema on data coming out of the database, and is agnostic to the data format being written.

當強式一致性保證很重要時,RDBMS 很有用 — 其中所有變更都不可部分完成,而交易一律要讓資料維持一致的狀態。An RDBMS is very useful when strong consistency guarantees are important — where all changes are atomic, and transactions always leave the data in a consistent state. 不過,基礎結構不會讓自己藉由散發儲存體和跨機器處理來相應放大。However, the underlying structures do not lend themselves to scaling out by distributing storage and processing across machines. 此外,儲存在 RDBMS 的資訊,必須依照正規化程序放入關聯式結構。Also, information stored in an RDBMS, must be put into a relational structure by following the normalization process. 雖然此程序易於了解,但是它會導致低效率,因為需要邏輯實體解譯成個別資料表中的資料列,然後在執行查詢時重組資料。While this process is well understood, it can lead to inefficiencies, because of the need to disassemble logical entities into rows in separate tables, and then reassemble the data when running queries.

相關 Azure 服務:Relevant Azure services:

索引鍵/值存放區Key/value stores

索引鍵/值存放區基本上是大型雜湊資料表。A key/value store is essentially a large hash table. 讓每個資料值與唯一的索引鍵產生關聯,索引鍵/值存放區會使用此索引鍵來儲存資料,方法是使用適當的雜湊函式。You associate each data value with a unique key, and the key/value store uses this key to store the data by using an appropriate hashing function. 會選取雜湊函式以提供跨資料儲存體雜湊索引鍵的平均分配。The hashing function is selected to provide an even distribution of hashed keys across the data storage.

大部分索引鍵/值存放區僅支援簡單的查詢、插入和刪除作業。Most key/value stores only support simple query, insert, and delete operations. 若要修改值 (部分或完全),應用程式必須覆寫整個值的現有資料。To modify a value (either partially or completely), an application must overwrite the existing data for the entire value. 在大部分實作中,讀取或寫入單一值是不可部分完成的作業。In most implementations, reading or writing a single value is an atomic operation. 如果值很大,寫入可能需要一些時間。If the value is large, writing may take some time.

應用程式可以將任意資料儲存為一組值,雖然某些索引鍵/值存放區會對值的大小上限施加限制。An application can store arbitrary data as a set of values, although some key/value stores impose limits on the maximum size of values. 儲存的值對於儲存體系統軟體是不透明的。The stored values are opaque to the storage system software. 任何結構描述資訊都必須由應用程式提供並解譯。Any schema information must be provided and interpreted by the application. 基本上,值為 blob 且索引鍵/值存放區只依據索引鍵擷取或儲存值。Essentially, values are blobs and the key/value store simply retrieves or stores the value by key.

索引鍵-值存放區圖

索引鍵/值存放區針對執行簡單查閱的應用程式高度最佳化,但是較不適合需要跨不同索引鍵/值存放區查詢資料的系統。Key/value stores are highly optimized for applications performing simple lookups, but are less suitable for systems that need to query data across different key/value stores. 索引鍵/值存放區也未針對以下案例最佳化:依據值來查詢很重要,而不是只根據索引鍵來執行查閱。Key/value stores are also not optimized for scenarios where querying by value is important, rather than performing lookups based only on keys. 例如,您可以使用關聯式資料庫以使用 WHERE 子句來尋找記錄,但是索引鍵/值存放區通常不會有這種類型的值查閱功能。For example, with a relational database, you can find a record by using a WHERE clause, but key/values stores usually do not have this type of lookup capability for values.

單一索引鍵/值存放區可以極度擴充,因為資料存放區可以輕易地在不同機器上的多個節點之間分散資料。A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes on separate machines.

相關 Azure 服務:Relevant Azure services:

文件資料庫Document databases

文件資料庫在概念上類似索引鍵/值存放區,不同之處在於它會儲存具名欄位和資料 (稱為文件) 的集合,每個項目都可以是簡單的純量項目或複合元素,例如清單和子集合。A document database is conceptually similar to a key/value store, except that it stores a collection of named fields and data (known as documents), each of which could be simple scalar items or compound elements such as lists and child collections. 文件之欄位中的資料可以各種不同的方式編碼,包括 XML、YAML、JSON、BSON,或甚至以純文字形式儲存。The data in the fields of a document can be encoded in a variety of ways, including XML, YAML, JSON, BSON,or even stored as plain text. 不同於索引鍵/值存放區,文件中的欄位會公開至儲存體管理系統,讓應用程式可以藉由使用這些欄位中的值來查詢和篩選資料。Unlike key/value stores, the fields in documents are exposed to the storage management system, enabling an application to query and filter data by using the values in these fields.

一般而言,文件包含實體的整個資料。Typically, a document contains the entire data for an entity. 哪些項目構成實體是應用程式特有的。What items constitute an entity are application specific. 例如,實體可能包含客戶、訂單或兩者組合的詳細資料。For example, an entity could contain the details of a customer, an order, or a combination of both. 單一文件可能包含會分散於 RDBMS 中數個關聯式資料表的資訊。A single document may contain information that would be spread across several relational tables in an RDBMS.

文件存放區不需要所有文件都具有相同的結構。A document store does not require that all documents have the same structure. 此自由格式的方法提供極大的彈性。This free-form approach provides a great deal of flexibility. 應用程式可以隨著商務需求變更而在文件中儲存不同的資料。Applications can store different data in documents as business requirements change.

文件存放區圖

應用程式可以使用文件索引鍵來擷取文件。The application can retrieve documents by using the document key. 這是文件的唯一識別碼,通常是雜湊,以協助平均散發資料。This is a unique identifier for the document, which is often hashed, to help distribute data evenly. 某些文件資料庫會自動建立文件索引鍵。Some document databases create the document key automatically. 其他資料庫則可讓您指定要作為索引鍵的文件屬性。Others enable you to specify an attribute of the document to use as the key. 應用程式也可以根據一或多個欄位的值來查詢文件。The application can also query documents based on the value of one or more fields. 某些文件資料庫支援編製索引,以便根據一或多個索引的欄位來快速查閱文件。Some document databases support indexing to facilitate fast lookup of documents based on one or more indexed fields.

許多文件資料庫支援就地更新,讓應用程式不需要重寫整份文件,就可以修改文件中特定欄位的值。Many document databases support in-place updates, enabling an application to modify the values of specific fields in a document without rewriting the entire document. 單一文件中多個欄位的讀取和寫入作業通常是不可部分完成。Read and write operations over multiple fields in a single document are usually atomic.

相關 Azure 服務:Cosmos DBRelevant Azure service: Cosmos DB

圖表資料庫Graph databases

圖表資料庫會儲存兩種類型的資訊、節點和邊緣。A graph database stores two types of information, nodes and edges. 您可以將節點視為實體。You can think of nodes as entities. 邊緣,指定節點之間的關聯性。Edges which specify the relationships between nodes. 節點和邊緣都有屬性,提供該節點或邊緣的相關資訊,類似於資料表中的資料行。Both nodes and edges can have properties that provide information about that node or edge, similar to columns in a table. 邊緣也可以有方向,指出關聯性的本質。Edges can also have a direction indicating the nature of the relationship.

圖表資料庫的目的是允許應用程式有效率地執行查詢,周遊節點和邊緣的網路,並分析實體之間的關聯性。The purpose of a graph database is to allow an application to efficiently perform queries that traverse the network of nodes and edges, and to analyze the relationships between entities. 下圖顯示以圖表為結構之組織的人員資料庫。The following diagram shows an organization's personnel database structured as a graph. 實體是員工和部門,邊緣表示報告關聯性以及員工工作所在的部門。The entities are employees and departments, and the edges indicate reporting relationships and the department in which employees work. 在此圖表中,邊緣上的箭號會顯示關聯性的方向。In this graph, the arrows on the edges show the direction of the relationships.

文件資料庫圖

此結構會讓執行例如「尋找直接或間接向 Sarah 報告的所有員工」或「誰在與 John 一樣的部門工作?」的查詢更加直覺。This structure makes it straightforward to perform queries such as "Find all employees who report directly or indirectly to Sarah" or "Who works in the same department as John?" 針對具有大量實體和關聯性的大型圖表,您可以非常快速地執行非常複雜的分析。For large graphs with lots of entities and relationships, you can perform very complex analyses very quickly. 許多圖表資料庫提供查詢語言,可讓您有效率地周遊關聯性的網路。Many graph databases provide a query language that you can use to traverse a network of relationships efficiently.

相關 Azure 服務:Cosmos DBRelevant Azure service: Cosmos DB

資料行系列資料庫Column-family databases

資料行系列資料庫會將資料組織成資料列和資料行。A column-family database organizes data into rows and columns. 在其最簡單的形式中,資料行系列資料庫的外觀可以非常類似於關聯式資料庫,至少在概念上是如此。In its simplest form, a column-family database can appear very similar to a relational database, at least conceptually. 資料行系列資料庫的實際能力在於其建構疏鬆資料的反正規化方法。The real power of a column-family database lies in its denormalized approach to structuring sparse data.

您可以將資料行系列資料庫視為保存具有資料列和資料行的表格式資料,但是資料行分成稱為「資料行系列」的群組。You can think of a column-family database as holding tabular data with rows and columns, but the columns are divided into groups known as column families. 每個資料行系列會保存一組資料行,邏輯上相互關聯,通常當作一個單位來擷取或管理。Each column family holds a set of columns that are logically related together and are typically retrieved or manipulated as a unit. 個別存取的其他資料可以儲存在個別的資料行系列中。Other data that is accessed separately can be stored in separate column families. 在資料行系列內,新資料行可以動態新增,而資料列可以是疏鬆的 (亦即,一個資料列不一定要具有每個資料行的值)。Within a column family, new columns can be added dynamically, and rows can be sparse (that is, a row doesn't need to have a value for every column).

下圖顯示具有兩個資料行系列 IdentityContact Info 的範例。The following diagram shows an example with two column families, Identity and Contact Info. 單一實體的資料在每個資料行系列中會有相同的資料列索引鍵。The data for a single entity has the same row key in each column-family. 此結構 (其中資料行系列中任何指定物件的資料列會動態變化) 是資料行系列方法的重要優點,讓這種形式的資料存放區高度適用於儲存結構化、動態資料。This structure, where the rows for any given object in a column family can vary dynamically, is an important benefit of the column-family approach, making this form of data store highly suited for storing structured, volatile data.

資料行系列資料庫圖

不同於索引鍵/值存放區或文件資料庫,大部分資料行系列資料庫會以索引鍵順序來儲存資料,而不是藉由計算雜湊。Unlike a key/value store or a document database, most column-family databases store data in key order, rather than by computing a hash. 許多實作可讓您在資料行系列中的特定資料行上建立索引。Many implementations allow you to create indexes over specific columns in a column-family. 索引可讓您依據資料行值擷取資料,而不是依據資料列索引鍵。Indexes let you retrieve data by columns value, rather than row key.

資料列的讀取和寫入作業通常是單一資料行系列不可部分完成的,雖然某些實作在跨越多個資料行系列的整個資料列提供不可部分完成的作業。Read and write operations for a row are usually atomic with a single column-family, although some implementations provide atomicity across the entire row, spanning multiple column-families.

相關 Azure 服務:HDInsight 中的 HBaseRelevant Azure service: HBase in HDInsight

資料分析Data analytics

資料分析存放區提供擷取、儲存及分析資料的大量平行解決方案。Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. 這項資料會在多部伺服器之間散發,使用不共用任何項目架構以最大化擴充性及最小化相依性。This data is distributed across multiple servers using a share-nothing architecture to maximize scalability and minimize dependencies. 資料不是靜態,因此這些存放區都必須能夠在繼續處理新查詢的同時,處理來自多個串流之各種不同格式的大量資訊。The data is unlikely to be static, so these stores must be able to handle large quantities of information, arriving in a variety of formats from multiple streams, while continuing to process new queries.

相關 Azure 服務:Relevant Azure services:

搜尋引擎資料庫Search Engine Databases

搜尋引擎資料庫支援搜尋保存在外部資料存放區和服務之資訊的能力。A search engine database supports the ability to search for information held in external data stores and services. 搜尋引擎資料庫可以用於對大量資料編制索引,以及提供這些索引的即時存取。A search engine database can be used to index massive volumes of data and provide near real-time access to these indexes. 雖然搜尋引擎資料庫通常被視為與 Web 同義,許多大型系統使用它們以在自己資料庫的頂端提供結構化及臨機操作搜尋功能。Although search engine databases are commonly thought of as being synonymous with the web, many large-scale systems use them to provide structured and ad-hoc search capabilities on top of their own databases.

搜尋引擎資料庫的主要特性是非常快速地儲存資訊並編制索引,以及為搜尋要求提供快速回應時間的能力。The key characteristics of a search engine database are the ability to store and index information very quickly, and provide fast response times for search requests. 索引可以是多維度,且可支援跨大量文字資料的任意文字搜尋。Indexes can be multi-dimensional and may support free-text searches across large volumes of text data. 索引可以藉由使用提取模型來執行、由搜尋引擎資料庫觸發,或者使用推送模型,由外部應用程式程式碼起始。Indexing can be performed using a pull model, triggered by the search engine database, or using a push model, initiated by external application code.

搜尋可以是精確或模糊的。Searching can be exact or fuzzy. 模糊搜尋會尋找符合一組條件的文件,並且計算符合程度。A fuzzy search finds documents that match a set of terms and calculates how closely they match. 某些搜尋引擎也支援可以根據同義字、內容類型擴充 (例如,比對 dogspets) 和詞幹分析 (比對字詞與相同字根) 傳回相符項目的語言分析。Some search engines also support linguistic analysis that can return matches based on synonyms, genre expansions (for example, matching dogs to pets), and stemming (matching words with the same root).

相關 Azure 服務:Azure 搜尋服務Relevant Azure service: Azure Search

時間序列資料庫Time Series Databases

時間序列資料是依時間組織的一組值,且時間序列資料庫是最適合用於此資料類型的資料庫。Time series data is a set of values organized by time, and a time series database is a database that is optimized for this type of data. 時間序列資料庫必須支援非常大量的寫入,因為它們通常會即時從大量來源收集大量資料。Time series databases must support a very high number of writes, as they typically collect large amounts of data in real time from a large number of sources. 更新很少,刪除通常會以大量作業來完成。Updates are rare, and deletes are often done as bulk operations. 雖然寫入時間序列資料庫的記錄通常很小,但是經常會有大量記錄,資料大小總計會快速成長。Although the records written to a time-series database are generally small, there are often a large number of records, and total data size can grow rapidly.

時間序列資料庫適合用於儲存遙測資料。Time series databases are good for storing telemetry data. 案例包括 IoT 感應器或應用程式/系統計數器。Scenarios include IoT sensors or application/system counters.

相關 Azure 服務:時間序列深入解析Relevant Azure service: Time Series Insights

物件儲存體Object storage

物件儲存體已針對儲存和擷取大型二進位物件 (映像、檔案、影片和音訊串流、大型應用程式資料物件和文件、虛擬機器磁碟映像) 最佳化。Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio streams, large application data objects and documents, virtual machine disk images). 這些存放區類型中的物件是由預存資料、某些中繼資料和用來存取物件的唯一識別碼所組成。Objects in these store types are composed of the stored data, some metadata, and a unique ID for accessing the object. 物件存放區能夠進行極大量非結構化資料的管理。Object stores enables the management of extremely large amounts of unstructured data.

相關 Azure 服務:Blob 儲存體Relevant Azure service: Blob Storage

共用的檔案Shared files

有時候,使用簡單的一般檔案是儲存和擷取資訊的最有效方式。Sometimes, using simple flat files can be the most effective means of storing and retrieving information. 使用檔案共用可以讓檔案跨網路存取。Using file shares enables files to be accessed across a network. 指定適當安全性和並行存取控制機制,以這種方式共用資料可以讓發佈服務提供可高度調整的資料存取權,來執行基本、低階作業,例如簡單的讀取和寫入要求。Given appropriate security and concurrent access control mechanisms, sharing data in this way can enable distributed services to provide highly scalable data access for performing basic, low-level operations such as simple read and write requests.

相關 Azure 服務:檔案儲存體Relevant Azure service: File Storage