您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

了解数据存储模型Understand data store models

现代业务系统管理越来越多的异类数据。Modern business systems manage increasingly large volumes of heterogeneous data. 这种异构性意味着单一数据存储通常不是最佳的方案。This heterogeneity means that a single data store is usually not the best approach. 通常情况下,最好将不同类型的数据存储在不同的数据存储区中,每个数据存储都面向特定的工作负载或使用模式。Instead, it's often better to store different types of data in different data stores, each focused toward a specific workload or usage pattern. 术语“多语言持久性”用于描述混合使用多种数据存储技术的解决方案。The term polyglot persistence is used to describe solutions that use a mix of data store technologies. 因此,请务必了解主存储模型及其利弊。Therefore, it's important to understand the main storage models and their tradeoffs.

根据要求选择适当的数据存储是一项关键设计决策。Selecting the right data store for your requirements is a key design decision. 我们真正可以从数百种 SQL 和 NoSQL 数据库实现中进行选择。There are literally hundreds of implementations to choose from among SQL and NoSQL databases. 数据存储通常根据它们将数据结构化的方式以及支持的操作类型分类。Data stores are often categorized by how they structure data and the types of operations they support. 本文介绍几种最常见的存储模型。This article describes several of the most common storage models. 请注意,某种特定的数据存储技术可能支持多个存储模型。Note that a particular data store technology may support multiple storage models. 例如,关系数据库管理系统 (RDBMS) 可能还支持键/值或图形存储。For example, a relational database management systems (RDBMS) may also support key/value or graph storage. 事实上,对于所谓的 多模型 支持,有一个一般趋势,其中单个数据库系统支持多个模型。In fact, there is a general trend for so-called multi-model support, where a single database system supports several models. 但是,在较高层面上了解不同的模型仍然很有帮助。But it's still useful to understand the different models at a high level.

并非给定类别中的所有数据存储都提供相同的功能集。Not all data stores in a given category provide the same feature-set. 大多数数据存储提供服务器端功能用于查询和处理数据。Most data stores provide server-side functionality to query and process data. 有时,此功能已内置在数据存储引擎中。Sometimes this functionality is built into the data storage engine. 在其他情况下,数据存储和处理功能是分离的,另外,可能还有一些选项可用于执行处理和分析。In other cases, the data storage and processing capabilities are separated, and there may be several options for processing and analysis. 数据存储还支持不同的编程和管理接口。Data stores also support different programmatic and management interfaces.

一般而言,应该先考虑哪种存储模型最符合要求。Generally, you should start by considering which storage model is best suited for your requirements. 然后,根据功能集、成本和易管理性等因素,考虑该类别中的特定数据存储。Then consider a particular data store within that category, based on factors such as feature set, cost, and ease of management.

关系数据库管理系统Relational database management systems

关系数据库可将数据组织成一系列包含行与列的二维表。Relational databases organize data as a series of two-dimensional tables with rows and columns. 大多数供应商提供了结构化查询语言 (SQL) 的方言来检索和管理数据。Most vendors provide a dialect of the Structured Query Language (SQL) for retrieving and managing data. RDBMS 通常实施一个事务一致性机制,该机制遵守用于更新信息的 ACID(原子性、一致性、隔离性、持久性)模型。An RDBMS typically implements a transactionally consistent mechanism that conforms to the ACID (Atomic, Consistent, Isolated, Durable) model for updating information.

RDBMS 通常支持写时架构模型,其中的数据结构已提前定义,所有读取或写入操作必须使用该架构。An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead of time, and all read or write operations must use the schema.

如果强一致性保证在 — 所有更改都是原子的情况下非常重要,并且事务始终使数据保持一致状态,则此模型非常有用。This model is very useful when strong consistency guarantees are important — where all changes are atomic, and transactions always leave the data in a consistent state. 不过,RDBMS 通常不能水平横向扩展,无需以某种方式分片数据。However, an RDBMS generally can't scale out horizontally without sharding the data in some way. 此外,RDBMS 中的数据必须规范化,这并不适合每个数据集。Also, the data in an RDBMS must normalized, which isn't appropriate for every data set.

Azure 服务Azure services

工作负荷Workload

  • 通常会创建和更新记录。Records are frequently created and updated.
  • 多个操作必须在单个事务中完成。Multiple operations have to be completed in a single transaction.
  • 使用数据库约束强制实施关系。Relationships are enforced using database constraints.
  • 使用索引来优化查询性能。Indexes are used to optimize query performance.

数据类型Data type

  • 数据是高度规范化的。Data is highly normalized.
  • 数据架构是必需的并强制实施。Database schemas are required and enforced.
  • 数据库中的数据实体之间的多对多关系。Many-to-many relationships between data entities in the database.
  • 约束是在架构中定义的,并施加于数据库中的任何数据。Constraints are defined in the schema and imposed on any data in the database.
  • 数据需要具有高完整性。Data requires high integrity. 索引和关系需要准确地维护。Indexes and relationships need to be maintained accurately.
  • 数据需要具有强一致性。Data requires strong consistency. 事务的运行方式将确保所有数据对于所有用户和进程而言都 100% 一致。Transactions operate in a way that ensures all data are 100% consistent for all users and processes.
  • 单个数据条目的大小小到中等大小。Size of individual data entries is small to medium-sized.

示例Examples

  • 库存管理Inventory management
  • 订单管理Order management
  • 报表数据库Reporting database
  • 计帐Accounting

键/值存储Key/value stores

键/值存储将每个数据值与唯一键关联起来。A key/value store associates each data value with a unique key. 大多数键/值存储仅支持简单的查询、插入和删除操作。Most key/value stores only support simple query, insert, and delete operations. 若要修改某个值(修改一部分或整个值),应用程序必须覆盖整个值的现有数据。To modify a value (either partially or completely), an application must overwrite the existing data for the entire value. 在大多数实现中,读取或写入单个值是原子操作。In most implementations, reading or writing a single value is an atomic operation.

应用程序可将任意数据存储为一组值。An application can store arbitrary data as a set of values. 任何架构信息都必须由应用程序提供。Any schema information must be provided by the application. 键/值存储区只按键检索或存储值。The key/value store simply retrieves or stores the value by key.

键值存储图

键/值存储对执行简单查找的应用程序进行了高度优化,但如果需要在不同的键/值存储中查询数据,则这些存储不适合。Key/value stores are highly optimized for applications performing simple lookups, but are less suitable if you need to query data across different key/value stores. 键/值存储也不优化为按值进行查询。Key/value stores are also not optimized for querying by value.

单个键/值存储就具有极高的可伸缩性,因为数据存储可在独立计算机上的多个节点之间轻松分配数据。A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes on separate machines.

Azure 服务Azure services

工作负荷Workload

  • 使用单个键(如字典)访问数据。Data is accessed using a single key, like a dictionary.
  • 不需要使用联接、锁定或联合。No joins, lock, or unions are required.
  • 不使用聚合机制。No aggregation mechanisms are used.
  • 通常不使用辅助索引。Secondary indexes are generally not used.

数据类型Data type

  • 每个键都与一个值相关联。Each key is associated with a single value.
  • 未实施架构。There is no schema enforcement.
  • 实体之间没有关系。No relationships between entities.

示例Examples

  • 数据缓存Data caching
  • 会话管理Session management
  • 用户首选项和配置文件管理User preference and profile management
  • 产品推荐和广告服务Product recommendation and ad serving

文档数据库Document databases

文档数据库存储 文档 的集合,其中每个文档都包含命名字段和数据。A document database stores a collection of documents, where each document consists of named fields and data. 数据可以是简单值,也可以是复杂元素,例如列表和子集合。The data can be simple values or complex elements such as lists and child collections. 文档由唯一键检索。Documents are retrieved by unique keys.

通常,文档包含单个实体(如客户或订单)的数据。Typically, a document contains the data for single entity, such as a customer or an order. 文档可能包含将在 RDBMS 中分散到多个关系表中的信息。A document may contain information that would be spread across several relational tables in an RDBMS. 文档不需要具有相同的结构。Documents don't need to have the same structure. 随着业务需求的变化,应用程序可在文档中存储不同的数据。Applications can store different data in documents as business requirements change.

文档存储图

Azure 服务Azure service

工作负荷Workload

  • 插入和更新操作很常见。Insert and update operations are common.
  • 没有对象关系阻抗不匹配。No object-relational impedance mismatch. 文档可以更好地匹配应用程序代码中使用的对象结构。Documents can better match the object structures used in application code.
  • 单个文档将作为单个块进行检索和写入。Individual documents are retrieved and written as a single block.
  • 数据需要基于多个字段编制索引。Data requires index on multiple fields.

数据类型Data type

  • 可以采用非规范化的方式管理数据。Data can be managed in de-normalized way.
  • 单个文档的数据大小相对较小。Size of individual document data is relatively small.
  • 每个文档类型可以使用其自己的架构。Each document type can use its own schema.
  • 文档可以包括可选字段。Documents can include optional fields.
  • 文档数据是半结构化的,这意味着每个字段的数据类型不是严格定义的。Document data is semi-structured, meaning that data types of each field are not strictly defined.

示例Examples

  • 产品目录Product catalog
  • 内容管理Content management
  • 库存管理Inventory management

图形数据库Graph databases

图形数据库存储两种类型的信息:节点和边缘。A graph database stores two types of information, nodes and edges. 边缘指定节点之间的关系。Edges specify relationships between nodes. 节点和边缘可以具有提供有关该节点或边缘的信息的属性,类似于表中的列。Nodes and edges can have properties that provide information about that node or edge, similar to columns in a table. 边缘还可以包含一个方向用于指示关系的性质。Edges can also have a direction indicating the nature of the relationship.

图形数据库可以跨节点和边缘的网络高效地执行查询,以及分析实体之间的关系。Graph databases can efficiently perform queries across the network of nodes and edges and analyze the relationships between entities. 下图显示了一个已结构化为图形的组织人员数据库。The following diagram shows an organization's personnel database structured as a graph. 实体是员工和部门,边缘表示报表关系和员工工作的部门。The entities are employees and departments, and the edges indicate reporting relationships and the departments in which employees work.

文档数据库图

使用此结构可以简单直接地执行类似于“查找 Sarah 的直接或间接下属”或“谁与 John 在同一个部门工作?”的查询。This structure makes it straightforward to perform queries such as "Find all employees who report directly or indirectly to Sarah" or "Who works in the same department as John?" 对于包含大量实体和关系的大型图形,可以极快地执行非常复杂的分析。For large graphs with lots of entities and relationships, you can perform very complex analyses very quickly. 多个图形数据库提供一种可用于高效遍历关系网络的查询语言。Many graph databases provide a query language that you can use to traverse a network of relationships efficiently.

Azure 服务Azure services

工作负荷Workload

  • 数据项之间的复杂关系涉及多个相关数据项之间的跃点。Complex relationships between data items involving many hops between related data items.
  • 数据项之间的关系是动态的并且随时间变化。The relationship between data items are dynamic and change over time.
  • 对象之间的关系是头等关系,不需要使用外键和联接进行遍历。Relationships between objects are first-class citizens, without requiring foreign-keys and joins to traverse.

数据类型Data type

  • 节点和关系。Nodes and relationships.
  • 节点类似于表行或 JSON 文档。Nodes are similar to table rows or JSON documents.
  • 关系与节点同等重要,并且是直接以查询语言公开的。Relationships are just as important as nodes, and are exposed directly in the query language.
  • 复合对象(例如具有多个电话号码的人员)通常分解为多个单独的较小节点,这些节点通过可遍历的关系组合在一起Composite objects, such as a person with multiple phone numbers, tend to be broken into separate, smaller nodes, combined with traversable relationships

示例Examples

  • 组织结构图Organization charts
  • 社交关系图Social graphs
  • 欺诈检测Fraud detection
  • 推荐引擎Recommendation engines

数据分析Data analytics

数据分析存储提供用于引入、存储和分析数据的大规模并行解决方案。Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. 数据分布在多个服务器上,以最大程度地提高可伸缩性。The data is distributed across multiple servers to maximize scalability. (CSV) 、 parquetORC 等大数据文件格式广泛用于数据分析。Large data file formats such as delimiter files (CSV), parquet, and ORC are widely used in data analytics. 历史数据通常存储在数据存储中,例如 blob 存储或 Azure Data Lake Storage Gen2,Azure Synapse、Databricks 或 HDInsight 会将其作为外部表进行访问。Historical data is typically stored in data stores such as blob storage or Azure Data Lake Storage Gen2, which are then accessed by Azure Synapse, Databricks, or HDInsight as external tables. 使用以 parquet 文件形式存储的数据作为性能的典型方案,请参阅将 外部表与 SYNAPSE SQL 结合使用一文。A typical scenario using data stored as parquet files for performance, is described in the article Use external tables with Synapse SQL.

Azure 服务Azure services

工作负荷Workload

  • 数据分析Data analytics
  • 企业 BIEnterprise BI

数据类型Data type

  • 来自多个源的历史数据。Historical data from multiple sources.
  • 通常是非规范化的,采用“星型”或“雪花型”架构,包含事实数据表和维度表。Usually denormalized in a "star" or "snowflake" schema, consisting of fact and dimension tables.
  • 通常按计划定期加载新数据。Usually loaded with new data on a scheduled basis.
  • 维度表通常包括实体的多个历史版本,称为渐变维度Dimension tables often include multiple historic versions of an entity, referred to as a slowly changing dimension.

示例Examples

  • 企业数据仓库Enterprise data warehouse

列系列数据库Column-family databases

列系列数据库将数据组织成行与列。A column-family database organizes data into rows and columns. 最简单形式的列系列数据库可能与关系数据库十分类似,至少在概念上是这样。In its simplest form, a column-family database can appear very similar to a relational database, at least conceptually. 列系列数据库的真正强大之处在于,它能够以非规范化方式将稀疏数据结构化。The real power of a column-family database lies in its denormalized approach to structuring sparse data.

可将列系列将数据库视为使用行与列保存表格数据,但是,列已分割为称作“列系列”的组。You can think of a column-family database as holding tabular data with rows and columns, but the columns are divided into groups known as column families. 每个列系列保存一组逻辑相关的、通常以单元形式检索或处理的列。Each column family holds a set of columns that are logically related together and are typically retrieved or manipulated as a unit. 其他单独访问的数据可存储在单独的列系列中。Other data that is accessed separately can be stored in separate column families. 在列系列中,可以动态添加新列,行可以稀疏分布(即,行不需要包含每个列的值)。Within a column family, new columns can be added dynamically, and rows can be sparse (that is, a row doesn't need to have a value for every column).

下图显示了包含两个列系列(IdentityContact Info)的示例。The following diagram shows an example with two column families, Identity and Contact Info. 在每个列系列中,单个实体的数据具有相同的行键。The data for a single entity has the same row key in each column-family. 此结构体现了列系列方法的重要优势,其中的列系列中任意给定对象的行可能动态变化,因此,这种形式的数据存储非常适合用于存储结构化的易失性数据。This structure, where the rows for any given object in a column family can vary dynamically, is an important benefit of the column-family approach, making this form of data store highly suited for storing structured, volatile data.

列系列数据库图

与键/值存储或文档数据库不同,大多数列系列数据库根据键顺序而不是通过计算哈希来存储数据。Unlike a key/value store or a document database, most column-family databases store data in key order, rather than by computing a hash. 许多实现允许基于列系列中的特定列创建索引。Many implementations allow you to create indexes over specific columns in a column-family. 使用索引可以根据列值而不是行键检索数据。Indexes let you retrieve data by columns value, rather than row key.

针对行执行的读取和写入操作通常是对单个列系列执行的原子操作,不过,某些实现可提供跨多个列系列的整行读写原子性。Read and write operations for a row are usually atomic with a single column-family, although some implementations provide atomicity across the entire row, spanning multiple column-families.

Azure 服务Azure services

工作负荷Workload

  • 大多数列系列数据库都极快地执行写入操作。Most column-family databases perform write operations extremely quickly.
  • 更新和删除操作很少发生。Update and delete operations are rare.
  • 设计用于提供高吞吐量低延迟访问。Designed to provide high throughput and low-latency access.
  • 支持轻松以查询方式访问非常大的记录中的一组特定字段。Supports easy query access to a particular set of fields within a much larger record.
  • 高度可伸缩。Massively scalable.

数据类型Data type

  • 数据存储在由一个键列和一个或多个列系列组成的表中。Data is stored in tables consisting of a key column and one or more column families.
  • 具体的列可能因各个行而异。Specific columns can vary by individual rows.
  • 可以通过 get 和 put 命令访问各个单元格Individual cells are accessed via get and put commands
  • 使用扫描命令返回多个行。Multiple rows are returned using a scan command.

示例Examples

  • 建议Recommendations
  • 个性化Personalization
  • 传感器数据Sensor data
  • 遥测Telemetry
  • 消息传递Messaging
  • 社交媒体分析Social media analytics
  • Web analyticsWeb analytics
  • 活动监视Activity monitoring
  • 天气和其他时序数据Weather and other time-series data

搜索引擎数据库Search Engine Databases

搜索引擎数据库允许应用程序搜索保存在外部数据存储中的信息。A search engine database allows applications to search for information held in external data stores. 搜索引擎数据库可以为大量数据编制索引,并提供对这些索引的近乎实时的访问。A search engine database can index massive volumes of data and provide near real-time access to these indexes.

索引可以是多维的,且支持跨大量文本数据执行自由文本搜索。Indexes can be multi-dimensional and may support free-text searches across large volumes of text data. 可以使用由搜索引擎数据库触发的拉取模型或者使用由外部应用程序代码启动的推送模型来执行索引编制。Indexing can be performed using a pull model, triggered by the search engine database, or using a push model, initiated by external application code.

搜索可以采用精确匹配或模糊匹配。Searching can be exact or fuzzy. 模糊搜索查找与一组字词匹配的文档,并计算它们的匹配程度。A fuzzy search finds documents that match a set of terms and calculates how closely they match. 某些搜索引擎还支持语言分析,此功能可根据同义词、类型扩展(例如,将 dogspets 匹配)和词干(将单词与同一个字根进行匹配)返回匹配结果。Some search engines also support linguistic analysis that can return matches based on synonyms, genre expansions (for example, matching dogs to pets), and stemming (matching words with the same root).

Azure 服务Azure service

工作负荷Workload

  • 来自多个源和服务的数据索引。Data indexes from multiple sources and services.
  • 查询是即席的,可能会很复杂。Queries are ad-hoc and can be complex.
  • 全文搜索是必需的。Full text search is required.
  • 即席自助查询是必需的。Ad hoc self-service query is required.

数据类型Data type

  • 半结构化或非结构化文本Semi-structured or unstructured text
  • 其中引用了结构化数据的文本Text with reference to structured data

示例Examples

  • 产品目录Product catalogs
  • 站点搜索Site search
  • LoggingLogging

时序数据库Time series databases

时序数据是按时间组织的一组值。Time series data is a set of values organized by time. 时序数据库通常会从大量源实时收集大量数据。Time series databases typically collect large amounts of data in real time from a large number of sources. 更新极少发生,而删除操作往往以批量操作的形式执行。Updates are rare, and deletes are often done as bulk operations. 尽管写入时序数据库的记录通常较小,但记录数量往往很大,并且总数据大小可能迅速增长。Although the records written to a time-series database are generally small, there are often a large number of records, and total data size can grow rapidly.

Azure 服务Azure service

工作负荷Workload

  • 记录通常按时间顺序依次追加。Records are generally appended sequentially in time order.
  • 绝大部分 (95-99%) 的操作是写入。An overwhelming proportion of operations (95-99%) are writes.
  • 很少进行更新。Updates are rare.
  • 删除批量进行,并且针对连续的块或记录执行。Deletes occur in bulk, and are made to contiguous blocks or records.
  • 数据按升序或降序顺序进行排序,通常是并行的。Data is read sequentially in either ascending or descending time order, often in parallel.

数据类型Data type

  • 时间戳用作主键和排序机制。A timestamp is used as the primary key and sorting mechanism.
  • 标记可以定义有关该条目的类型、来源和其他信息的其他信息。Tags may define additional information about the type, origin, and other information about the entry.

示例Examples

  • 监视和事件遥测。Monitoring and event telemetry.
  • 传感器或其他 IoT 数据。Sensor or other IoT data.

对象存储Object storage

经优化的对象存储适合用于存储和检索大型二进制对象(图像、文件、视频和音频流、大型应用程序数据对象和文档、虚拟机磁盘映像)。Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio streams, large application data objects and documents, virtual machine disk images). 在此模型中还一般使用大数据文件,例如,分隔符文件 (CSV) 、 parquetORCLarge data files are also popularly used in this model, for example, delimiter file (CSV), parquet, and ORC. 对象存储可以管理极大量的非结构化数据。Object stores can manage extremely large amounts of unstructured data.

Azure 服务Azure service

工作负荷Workload

  • 由键进行标识。Identified by key.
  • 内容通常是一个资产,如分隔符、图像或视频文件。Content is typically an asset such as a delimiter, image, or video file.
  • 内容必须是持久性的,并且是任何应用程序层的外部内容。Content must be durable and external to any application tier.

数据类型Data type

  • 数据大小较大。Data size is large.
  • 值是不透明的。Value is opaque.

示例Examples

  • 图像、视频、Office 文档、PDFImages, videos, office documents, PDFs
  • 静态 HTML,JSON,CSSStatic HTML, JSON, CSS
  • 日志和审核文件Log and audit files
  • 数据库备份Database backups

共享文件Shared files

有时,使用简单的平面文件可能是存储和检索信息的最有效方法。Sometimes, using simple flat files can be the most effective means of storing and retrieving information. 使用文件共享可以跨网络访问文件。Using file shares enables files to be accessed across a network. 在提供了相应的安全和并发访问控制机制的前提下,以这种方法共享数据可让分布式服务提供高度可缩放的数据访问方式来执行基本的低级别操作,例如简单的读取和写入请求。Given appropriate security and concurrent access control mechanisms, sharing data in this way can enable distributed services to provide highly scalable data access for performing basic, low-level operations such as simple read and write requests.

Azure 服务Azure service

工作负荷Workload

  • 从与文件系统进行交互的现有应用进行迁移。Migration from existing apps that interact with the file system.
  • 需要 SMB 接口。Requires SMB interface.

数据类型Data type

  • 一组分层文件夹中的文件。Files in a hierarchical set of folders.
  • 可以通过标准 I/O 库进行访问。Accessible with standard I/O libraries.

示例Examples

  • 旧式文件Legacy files
  • 可以从许多 VM 或应用实例访问的共享内容Shared content accessible among a number of VMs or app instances

利用这种对不同数据存储模型的了解,下一步是评估工作负荷和应用程序,并决定哪种数据存储可满足特定需求。Aided with this understanding of different data storage models, the next step is to evaluate your workload and application, and decide which data store will meet your specific needs. 使用 " 数据存储" 决策树 有助于处理此过程。Use the data storage decision tree to help with this process.