您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

用于架构迁移的数据定义语言Data definition languages for schema migration

本文介绍将架构迁移到 Azure Synapse Analytics 时, (Ddl) 的数据定义语言的设计注意事项和性能选项。This article describes design considerations and performance options for data definition languages (DDLs) when you're migrating schemas to Azure Synapse Analytics.

设计注意事项Design considerations

规划架构迁移时,请查看这些设计注意事项。Review these design considerations as you plan your schema migration.

迁移准备Preparation for migration

准备将现有数据迁移到 Azure Synapse Analytics 时,务必明确定义练习 (范围,尤其是对于初始迁移项目) 。When you're preparing to migrate existing data to Azure Synapse Analytics, it's important to clearly define the scope of the exercise (especially for an initial migration project). 提前了解数据库对象和相关进程将如何迁移的时间,可以在项目的后期减少工作量和风险。The time spent up front to understand how database objects and related processes will migrate can reduce both effort and risk later in the project.

创建要迁移的数据库对象的清单。Create an inventory of database objects to be migrated. 根据源平台,此清单将包括以下部分或全部对象:Depending on the source platform, this inventory will include some or all of the following objects:

  • Tables
  • 视图Views
  • 索引Indexes
  • 函数Functions
  • 存储过程Stored procedures
  • 数据分发和分区Data distribution and partitioning

这些对象的基本信息应该包含指标,例如行计数、物理大小、数据压缩率和对象依赖关系。The basic information for these objects should include metrics such as row counts, physical size, data compression ratios, and object dependencies. 此信息应通过查询源系统中的系统目录表提供。This information should be available via queries against system catalog tables in the source system. 系统元数据是此信息的最佳来源。The system metadata is the best source for this information. 外部文档可能已过时,并且不与自初始实现后已应用于数据结构的更改同步。External documentation might be stale and not in sync with changes that have been applied to the data structure since the initial implementation.

你还可以分析查询日志中的实际对象使用情况,或使用 Microsoft 合作伙伴提供的工具(如 Attunity 可见性)来帮助。You might also be able to analyze actual object usage from query logs or use tooling from Microsoft partners like Attunity Visibility to help. 某些表可能不需要迁移,因为它们不再用于生产查询。It's possible that some tables don't need to be migrated because they're no longer used in production queries.

数据大小和工作负荷信息对于 Azure Synapse 分析非常重要,因为它有助于定义适当的配置。Data size and workload information is important for Azure Synapse Analytics because it helps to define appropriate configurations. 其中一个示例是所需的并发级别。One example is the required levels of concurrency. 了解数据和工作负荷的预期增长可能会影响到建议的目标配置,因此也可以利用此信息。Understanding the expected growth of data and workloads might affect a recommended target configuration, and it's a good practice to also harness this information.

当你使用数据卷来估计新目标平台所需的存储空间时,请务必了解源数据库的数据压缩率(如果有)。When you're using data volumes to estimate the storage required for the new target platform, it's important to understand the data compression ratio, if any, on the source database. 只需使用源系统中所使用的存储量,就能实现大小调整。Simply taking the amount of storage used on the source system is likely to be a false basis for sizing. 监视和元数据信息可帮助你确定未压缩的原始数据大小和开销,以在当前系统中建立索引、数据复制、日志记录或其他进程。Monitoring and metadata information can help you determine uncompressed raw data size and overheads for indexing, data replication, logging, or other processes in the current system.

要迁移的表的未压缩的原始数据大小是一个很好的起点,用于估计新目标 Azure Synapse 分析环境中所需的存储。The uncompressed raw data size of the tables to be migrated is a good starting point for estimating the storage required in the new target Azure Synapse Analytics environment.

新的目标平台还将包括压缩因子和索引开销,但这些开销可能与源系统不同。The new target platform will also include a compression factor and indexing overhead, but these will probably be different from the source system. Azure Synapse Analytics 存储定价还包括7天的快照备份。Azure Synapse Analytics storage pricing also includes seven days of snapshot backups. 与现有环境进行比较时,这可能会影响存储所需的总体成本。When compared to the existing environment, this can have an impact on the overall cost of storage required.

在迁移过程中,你可以延迟对数据模型的性能优化,并在数据仓库中的真实数据卷上时,将其时间延迟。You can delay performance tuning for the data model until late in the migration process and time this with when real data volumes are in the data warehouse. 但是,我们建议你在前面执行一些性能优化选项。However, we recommend that you implement some performance tuning options earlier on.

例如,在 Azure Synapse Analytics 中,将小型维度表定义为复制表,并将大型事实数据表定义为聚集列存储索引是有意义的。For example, in Azure Synapse Analytics, it makes sense to define small dimension tables as replicated tables and to define large fact tables as clustered columnstore indexes. 同样,在源环境中定义的索引可以很好地指示哪些列可能会受益于新环境中的索引。Similarly, indexes defined in the source environment provide a good indication of which columns might benefit from indexing in the new environment. 如果在加载前对表进行定义时使用此信息,则将在此过程中稍后保存时间。Using this information when you're initially defining the tables before loading will save time later in the process.

在迁移项目的过程中,最好度量 Azure Synapse Analytics 中自己数据的压缩率和索引开销。It's good practice to measure the compression ratio and index overhead for your own data in Azure Synapse Analytics as the migration project progresses. 此度量值可实现未来的容量规划。This measure enables future capacity planning.

在迁移之前,可以通过降低复杂性以简化迁移,来简化现有数据仓库。It might be possible to simplify your existing data warehouse before migration by reducing complexity to ease migration. 这一工作可能包括:This effort might include:

  • 在迁移之前删除或存档未使用的表,以避免迁移未使用的数据。Removing or archiving unused tables before migrating to avoid migrating data that's not used. 将数据存档到 Azure Blob 存储并将数据定义为外部表可以使数据的成本更低。Archiving to Azure Blob Storage and defining the data as an external table might keep the data available for a lower cost.
  • 使用数据虚拟化软件将物理数据集市转换为虚拟数据市场,以减少需要迁移的内容。Converting physical data marts to virtual data marts by using data virtualization software to reduce what you have to migrate. 这种转换还可提高灵活性并降低总拥有成本。This conversion also improves agility and reduces total cost of ownership. 在迁移期间,你可以将其视为现代化。You might consider it as modernization during migration.

迁移练习的一个目标是通过更改基础数据模型来实现仓库的现代化。One objective of the migration exercise might also be to modernize the warehouse by changing the underlying data model. 一个示例是从 Inmon 样式的数据模型移动到数据保管库方法。One example is moving from an Inmon-style data model to a data vault approach. 你应将此作为准备阶段的一部分进行决定,并合并一个用于转换到迁移计划的策略。You should decide this as part of the preparation phase and incorporate a strategy for the transition into the migration plan.

在这种情况下,建议的方法是先将数据模型迁移到新的平台,然后再转换到 Azure Synapse Analytics 中的新模型。The recommended approach in this scenario is to first migrate the data model as is to the new platform and then transition to the new model in Azure Synapse Analytics. 使用平台的可伸缩性和性能特征来执行转换,而不会影响源系统。Use the platform's scalability and performance characteristics to execute the transformation without affecting the source system.

数据模型迁移Data model migration

根据源系统的平台和源,部分或全部部分的数据模型可能已经处于星型或雪花型架构窗体中。Depending on the platform and the origins of the source system, the data model of some or all parts may already be in a star or snowflake schema form. 如果是这样,可以直接将其迁移到 Azure Synapse Analytics。If so, you can directly migrate it to Azure Synapse Analytics as is. 此方案是实现最简单、最小的风险迁移。This scenario is the easiest and lowest-risk migration to achieve. 同样,迁移也可以是更复杂的迁移的第一个阶段,该阶段包括转换到新的基础数据模型(如数据保管库),如前文所述。An as-is migration can also be the first stage of a more complex migration that includes a transition to a new underlying data model such as a data vault, as described earlier.

任何一组关系表和视图都可以迁移到 Azure Synapse Analytics。Any set of relational tables and views can be migrated to Azure Synapse Analytics. 对于大型数据集的分析查询工作负荷,星型或雪花型数据模型通常提供最佳的整体性能。For analytical query workloads against a large data set, a star or snowflake data model generally gives the best overall performance. 如果源数据模型尚不在此窗体中,则可以使用迁移过程来重新设计模型。If the source data model is not already in this form, it might be worth using the migration process to reengineer the model.

如果迁移项目包含对数据模型所做的任何更改,最佳做法是在新的目标环境中执行这些更改。If the migration project includes any changes to the data model, the best practice is to perform these changes in the new target environment. 也就是说,首先迁移现有模型,然后使用 Azure Synapse Analytics 的强大功能和灵活性将数据转换为新模型。That is, migrate the existing model first, and then use the power and flexibility of Azure Synapse Analytics to transform the data to the new model. 这种方法可以最大程度地降低对现有系统的影响,并使用 Azure Synapse Analytics 的性能和可伸缩性来快速、经济高效地进行更改。This approach minimizes the impact on the existing system and uses the performance and scalability of Azure Synapse Analytics to make any changes quickly and cost-effectively.

可以将现有系统迁移为多个层;例如,数据引入/暂存层、数据仓库层和报表或数据市场层。You can migrate the existing system as several layers; for example, the data ingest/staging layer, data warehouse layer, and reporting or data mart layer. 每一层都包含关系表和视图。Each layer consists of relational tables and views. 尽管可以按原样将其迁移到 Azure Synapse Analytics,但使用 Azure 生态系统的某些功能可能更具成本效益和可靠性。Although you can migrate these to Azure Synapse Analytics as they are, it might be more cost-effective and reliable to use some of the features and capabilities of the Azure ecosystem. 例如:For example:

  • 数据引入和暂存: 可以在 ETL (提取、转换、加载) 或 ELT (提取、加载和转换) 过程中,将 Azure Blob 存储与 PolyBase 结合使用,而不是将关系表用于快速并行数据加载。Data ingest and staging: Instead of using relational tables for fast parallel data loading, you can use Azure Blob storage with PolyBase during part of the ETL (extract, transform, load) or ELT (extract, load, transform) process.

  • 数据引入和暂存: 可以将 Azure Blob 存储与 PolyBase 结合使用,以便为 ETL (提取、转换、加载) 或 ELT (提取、加载、转换) 进程,而不是关系表。Data ingest and staging: You can use Azure Blob Storage in conjunction with PolyBase for fast parallel data loading for part of the ETL (extract, transform, load) or ELT (extract, load, transform) process, rather than relational tables.

  • 报表层和数据市场: Azure Synapse 分析的性能特征可能无需以物理方式实例化用于报告或数据市场的聚合表。Reporting layer and data marts: The performance characteristics of Azure Synapse Analytics might eliminate the need to physically instantiate aggregated tables for reporting purposes or data marts. 可以通过将这些视图作为视图实现到核心数据仓库或第三方数据虚拟化层。It might be possible to implement these as views onto the core data warehouse or via a third-party data virtualization layer. 在基本级别上,你可以为历史数据的数据迁移,还可以获取增量更新的过程,如下图所示:At the basic level, you can achieve the process for data migration of historical data and possibly also incremental updates as shown in this diagram:

    演示新式数据仓库的关系图。Diagram that illustrates a modern data warehouse. 图1:新式数据仓库。Figure 1: A modern data warehouse.

如果可以使用这些方法或类似方法,将减少要迁移的表的数量。If you can use these or similar approaches, the number of tables to be migrated is reduced. 某些过程可能会被简化或消除,并再次降低迁移工作负荷。Some processes might be simplified or eliminated, again reducing the migration workload. 这些方法的适用性取决于单个用例。The applicability of these approaches depends on the individual use case. 但一般原则是,尽可能考虑使用 Azure 生态系统的功能和功能来减少迁移工作负载,并构建一个经济高效的目标环境。But, the general principle is to consider using the features and facilities of the Azure ecosystem, where possible, to reduce the migration workload and build a cost-effective target environment. 这也适用于其他功能,例如备份/还原和工作流管理和监视。This also holds true for other functions, such as backup/restore and workflow management and monitoring.

Microsoft 合作伙伴提供的产品和服务可帮助进行数据仓库迁移,并在某些情况下自动执行部分过程。Products and services available from Microsoft partners can help with data warehouse migrations and automate parts of the process for some cases. 如果现有系统合并了第三方 ETL 产品,它可能已支持将 Azure Synapse 分析作为目标环境。If the existing system incorporates a third-party ETL product, it might already support Azure Synapse Analytics as a target environment. 现有 ETL 工作流可以重定向到新的目标数据仓库。The existing ETL workflows can be redirected to the new target data warehouse.

数据市场:物理或虚拟Data marts: Physical or virtual

这是一种常见做法,适用于具有较旧数据仓库环境的组织创建数据市场,使其部门或业务功能具有良好的即席自助查询和报表性能。It's a common practice for organizations with older data warehouse environments to create data marts that provide their departments or business functions with good ad hoc self-service query and report performance. 数据市场通常包含数据仓库的一个子集,其中包含原始数据的聚合版本。A data mart typically consists of a subset of the data warehouse that contains aggregated versions of the original data. 它通常是一种维度数据模型,它支持用户轻松地查询数据,并从 Tableau、MicroStrategy 或 Microsoft Power BI 等用户友好工具接收快速响应时间。Its form, typically a dimensional data model, supports users to easily query the data and receive fast response times from user-friendly tools like Tableau, MicroStrategy, or Microsoft Power BI.

数据市场使用的一种用途是将数据以可用形式公开,即使基础仓库数据模型不同 (如数据保管库) 。One use of data marts is to expose the data in a usable form, even if the underlying warehouse data model is something different (such as a data vault). 此方法也称为三层模型。This approach is also known as a three-tier model.

你可以为组织中的各个业务部门使用单独的数据市场,以实现可靠的数据安全制度。You can use separate data marts for individual business units within an organization to implement robust data security regimes. 例如,你可以允许用户访问与其相关的特定数据市场,并消除、模糊处理或匿名敏感数据。For example, you can allow user access to specific data marts relevant to them and eliminate, obfuscate, or anonymize sensitive data.

如果这些数据市场以物理表的形式实现,它们需要额外的存储资源来容纳这些数据并进行额外的处理,以便定期生成和刷新它们。If these data marts are implemented as physical tables, they require additional storage resources to house them and additional processing to build and refresh them regularly. 物理表显示,市场中的数据仅与上一次刷新操作的当前时间相同,因此它们可能不适合高度可变的数据仪表板。Physical tables show that the data in the mart is only as current as the last refresh operation, so they may not be suitable for highly volatile data dashboards.

随着低成本、可缩放的大规模并行处理 (MPP) 体系结构的出现,你可能能够提供数据市场功能,而无需将市场作为一组物理表实例化。With the advent of low-cost, scalable massively parallel processing (MPP) architectures, you might be able to provide data mart functionality without needing to instantiate the mart as a set of physical tables. 可以通过使用以下方法之一来虚拟化数据市场,使用 Azure Synapse Analytics 实现此目的:You can achieve this with Azure Synapse Analytics by using one of these methods to virtualize the data marts:

  • 主数据仓库上的 SQL 视图。SQL views on the main data warehouse.
  • 使用 Azure Synapse Analytics 或第三方虚拟化产品(如 Denodo)中的视图等功能的虚拟化层。A virtualization layer that uses features such as views in Azure Synapse Analytics or third-party virtualization products such as Denodo.

此方法简化或消除了额外存储和聚合处理的需求。This approach simplifies or eliminates the need for additional storage and aggregation processing. 它减少了要迁移的数据库对象的总数。It reduces the overall number of database objects to be migrated.

数据仓库方法的另一个优点是能够在大数据卷上运行诸如联接和聚合之类的操作。Another benefit of the data warehouse approach is the capacity to run operations such as joins and aggregations on large data volumes. 例如,在虚拟化层中实现聚合和联接逻辑,并在虚拟化视图中显示外部报表,将创建这些视图所需的可靠处理推送到数据仓库中。For example, implementing the aggregation and join logic within a virtualization layer and displaying external reporting in a virtualized view push the robust processing required to create these views into the data warehouse.

选择用于实现物理或虚拟数据市场的主要驱动因素包括:The primary drivers for choosing to implement physical or virtual data mart implementation are:

  • 更具灵活性。More agility. 与物理表和关联的 ETL 进程相比,虚拟数据市场更易于更改。A virtual data mart is easier to change than physical tables and the associated ETL processes.
  • 由于较少的数据存储和虚拟化实现中数据的副本,降低总拥有成本。Lower total cost of ownership because of fewer data stores and copies of data in a virtualized implementation.
  • 消除 ETL 作业,以在虚拟化环境中迁移和简化数据仓库体系结构。Elimination of ETL jobs to migrate and simplified data warehouse architecture in a virtualized environment.
  • 性能。Performance. 从历史上看,物理数据市场更可靠。Historically, physical data marts have been more reliable. 虚拟化产品现在正在实施智能缓存技术来缓解这种情况。Virtualization products are now implementing intelligent caching techniques to mitigate this.

你还可以使用数据虚拟化在迁移项目过程中一致地向用户显示数据。You can also use data virtualization to display data to users consistently during a migration project.

数据映射Data mapping

Azure Synapse 分析中的关键和完整性约束Key and integrity constraints in Azure Synapse Analytics

Primary key 和 foreign key 约束目前不在 Azure Synapse 分析中强制执行。Primary key and foreign key constraints aren't currently enforced within Azure Synapse Analytics. 但是,可以 PRIMARY KEY 在语句中包含子句的定义 CREATE TABLE NOT ENFORCEDHowever, you can include the definition for PRIMARY KEY in the CREATE TABLE statement with the NOT ENFORCED clause. 这意味着第三方报表产品可以使用表的元数据来了解数据模型中的密钥,从而生成最有效的查询。This means that third-party reporting products can use the metadata for the table to understand the keys within the data model and therefore generate the most efficient queries.

Azure Synapse 分析中的数据类型支持Data type support in Azure Synapse Analytics

一些较旧的数据库系统包括对 Azure Synapse 分析中不直接支持的数据类型的支持。Some older database systems include support for data types that aren't directly supported within Azure Synapse Analytics. 您可以使用支持的数据类型来处理这些数据类型,以便按原样存储数据,或者将数据转换为支持的数据类型。You can handle these data types by using a supported data type to store the data as is or by transforming the data to a supported data type.

下面是受支持的数据类型的列表(按字母顺序排列):Here's an alphabetical list of supported data types:

  • bigint
  • binary [ (n) ]
  • bit
  • char [ (n) ]
  • date
  • datetime
  • datetime2 [ (n) ]
  • datetimeoffset [ (n) ]
  • decimal [ (precision [, scale ]) ]
  • float [ (n) ]
  • int
  • money
  • nchar [ (n) ]
  • numeric [ (precision [ , scale ]) ]
  • nvarchar [ (n | MAX) ]
  • real [ (n) ]
  • smalldatetime
  • smallint
  • smallmoney
  • time [ (n) ]
  • tinyint
  • uniqueidentifier
  • varbinary [ (n | MAX) ]
  • varchar [ (n | MAX) ]

下表列出了当前不支持的常用数据类型,以及用于将它们存储在 Azure Synapse Analytics 中的建议方法。The following table lists common data types that aren't currently supported, together with the recommended approach for storing them in Azure Synapse Analytics.

不支持的数据类型Unsupported data type 解决方法Workaround
geometry varbinary
geography varbinary
hierarchyid nvarchar(4000)
image varbinary
text varchar
ntext nvarchar
sql_variant 将列拆分成多个强类型化列Split column into several strongly typed columns
table 转换为临时表Convert to temporary tables
timestamp 改编要使用 datetime2 的代码和 CURRENT_TIMESTAMP 函数Rework code to use datetime2 and the CURRENT_TIMESTAMP function
xml varchar
用户定义类型User-defined type 尽可能转换回本机数据类型Convert back to the native data type when possible

潜在的数据问题Potential data issues

在迁移数据时,可能会出现一些问题,具体取决于源环境:Depending on the source environment, some issues can cause problems when you're migrating data:

  • NULL 不同的数据库产品中处理数据的方式可能存在细微的差异。There can be subtle differences in the way that NULL data is handled in different database products. 例如,排序规则序列和对空字符串的处理。Examples include collation sequence and handling of empty character strings.
  • DATETIMEINTERVALTIME ZONE 数据及关联的功能可能会因产品而异。DATE, TIME, INTERVAL, and TIME ZONE data and associated functions can vary widely from product to product.

对它们进行全面测试,以确定是否在目标环境中实现了所需的结果。Test these thoroughly to determine whether the desired results are achieved in the target environment. 迁移练习可以发现当前属于现有源系统的错误或不正确的结果,迁移过程是纠正异常的好机会。The migration exercise can uncover bugs or incorrect results that are currently part of the existing source system, and the migration process is a good opportunity to correct anomalies.

在 Azure Synapse 分析中定义列的最佳实践Best practices for defining columns in Azure Synapse Analytics

较早的系统通常包含具有低效数据类型的列。It's common for older systems to contain columns with inefficient data types. 例如,您可能会发现定义为 VARCHAR(20) 字段中的实际数据值的字段 CHAR(5)For example, you might find a field defined as VARCHAR(20) when the actual data values would fit into a CHAR(5) field. 或者,您可能会发现在 INTEGER 字段内容纳所有值时,字段的使用情况 SMALLINTOr, you might find the use of INTEGER fields when all values would fit within a SMALLINT field. 数据类型不足会导致存储和查询性能效率低下,尤其是在大型事实数据表中。Insufficient data types can lead to inefficiencies in both storage and query performance, especially in large fact tables.

这是在迁移过程中检查和合理化当前数据定义的好时机。It's a good time to check and rationalize current data definitions during a migration exercise. 您可以使用 SQL 查询来自动完成这些任务,以便在数据字段中查找最大数值或字符长度,并将结果与数据类型进行比较。You can automate these tasks by using SQL queries to find the maximum numeric value or character length within a data field and comparing the result to the data type.

一般而言,最好将表的总定义行长度降到最低。In general, it's a good practice to minimize the total defined row length for a table. 为了获得最佳查询性能,可以使用每列的最小数据类型,如前文所述。For the best query performance, you can use the smallest data type for each column, as described earlier. 从 Azure Synapse Analytics 中的外部表加载数据的建议方法是使用 PolyBase 实用程序,该实用程序支持最大定义的行长度为 1 mb (MB) 。The recommended approach to load data from external tables in Azure Synapse Analytics is to use the PolyBase utility, which supports a maximum defined row length of 1 megabyte (MB). PolyBase 不会加载行的长度超过 1 MB 的表,你必须改用此 bcp 实用工具PolyBase won't load tables with rows longer than 1 MB, and you must use the bcp utility instead.

为实现最有效的联接执行,请将联接两侧的列定义为相同的数据类型。For the most efficient join execution, define the columns on both sides of the join as the same data type. 如果将维度表的键定义为 SMALLINT ,则还应将使用该维度的事实数据表中相应的引用列定义为 SMALLINTIf the key of a dimension table is defined as SMALLINT, then the corresponding reference columns in fact tables using that dimension should also be defined as SMALLINT.

避免使用较大的默认大小定义字符字段。Avoid defining character fields with a large default size. 如果字段中的数据的最大大小为50个字符,请使用 VARCHAR(50)If the maximum size of data within a field is 50 characters, use VARCHAR(50). 同样, NVARCHAR 如果能满足,请不要使用 VARCHARSimilarly, don't use NVARCHAR if VARCHAR will suffice. NVARCHAR 存储 Unicode 数据以允许不同语言的字符集。NVARCHAR stores Unicode data to allow for different language character sets. VARCHAR 存储 ASCII 数据并占用更少的空间。VARCHAR stores ASCII data and takes less space.

设计建议摘要Summary of design recommendations

请勿迁移不必要的对象或进程。Don't migrate unnecessary objects or processes. 在目标 Azure 环境中使用内置特性和功能,以便减少要迁移的对象和进程的实际数量。Use built-in features and functions in the target Azure environment where appropriate to reduce the actual number of objects and processes to migrate. 请考虑使用虚拟化层来减少或消除你要迁移的物理数据集市的数量,并向下推送处理数据仓库。Consider using a virtualization layer to reduce or eliminate the number of physical data marts that you'll migrate and to push down processing into the data warehouse.

尽可能自动执行,并使用源系统中系统目录的元数据为目标环境生成 Ddl。Automate wherever possible, and use metadata from system catalogs in the source system to generate DDLs for the target environment. 如果可能,还会自动生成文档。If possible, also automate generating documents. Microsoft 合作伙伴(如 WhereScape)可以提供专门的工具和服务来帮助实现自动化。Microsoft partners such as WhereScape can provide specialized tools and services to assist with automation.

在目标平台上执行任何所需的数据模型更改或数据映射优化。Perform any required data model changes or data-mapping optimizations on the target platform. 可以在 Azure Synapse Analytics 中更有效地进行这些更改。You can make these changes more efficiently in Azure Synapse Analytics. 此方法可减少对可能已接近全部容量的源系统的影响。This approach reduces the impact on source systems that might already be running close to full capacity.

性能选项Performance options

本部分介绍 Azure Synapse 分析中的可用功能,这些功能可用于提高数据模型的性能。This section describes the features available within Azure Synapse Analytics that you can use to improve performance for a data model.

常规方法General approach

平台的功能对要迁移的数据库运行性能优化。The platform's features run performance tuning on the database that will be migrated. 索引、数据分区和数据分布是此类性能优化的示例。Indexes, data partitioning, and data distribution are examples of such performance tuning. 准备迁移时,记录优化可以捕获并显示可在 Azure Synapse Analytics 目标环境中应用的优化。When you're preparing for migration, documenting the tuning can capture and reveal optimizations that you can apply in the Azure Synapse Analytics target environment.

例如,如果表中存在非唯一索引,则可能表示索引中使用的字段经常用于筛选、分组或联接。For example, the presence of a non-unique index on a table can indicate that fields used in the index are used frequently for filtering, grouping, or joining. 这仍会出现在新环境中,因此在选择要编制索引的字段时,请记住这一点。This will still be the case in the new environment, so keep it in mind when you're choosing which fields to index there. 有关 Teradata 或 Netezza 环境的更多详细信息,请参阅 Azure 的 Synapse Analytics 文章: Teradata 和解决方案的 解决方案和迁移 ,以及 Netezza 的迁移For more detailed information about Teradata or Netezza environments, see Azure the Synapse Analytics articles about solutions and migration for Teradata and solutions and migration for Netezza.

使用目标 Azure Synapse Analytics 环境的性能和可伸缩性来试验不同的性能选项,如数据分布。Use the performance and scalability of the target Azure Synapse Analytics environment to experiment with different performance options like data distribution. 确定最佳方法 (例如,复制与大型维度表) 的哈希分布。Determine the best choice of alternative approaches (for example, replicated versus hash-distributed for a large dimension table). 这并不意味着必须从外部源重新加载数据。This doesn't mean that data must be reloaded from external sources. 通过语句,通过使用不同的分区或分布选项创建任何表的副本,可以相对简单快捷地在 Azure Synapse 分析中测试备用方法 CREATE TABLE AS SELECTIt's relatively quick and easy to test alternative approaches in Azure Synapse Analytics by creating copies of any table with different partitioning or distribution options via a CREATE TABLE AS SELECT statement.

使用 Azure 环境提供的监视工具来了解执行查询的方式,以及可能发生瓶颈的地方。Use the monitoring tools provided by the Azure environment to understand how queries are executed and where bottlenecks might be occurring. 此外,第三方 Microsoft 合作伙伴还提供了工具来提供监视仪表板和自动化资源管理和警报。Tools are also available from third-party Microsoft partners to provide monitoring dashboards and automated resource management and alerting.

Azure Synapse 分析和资源中的每个 SQL 操作(如内存或该查询使用的 CPU)都将被记录到系统表中。Each SQL operation in Azure Synapse Analytics and resource, such as memory or the CPU used by that query, is logged into system tables. 一系列动态管理视图可简化对此信息的访问。A series of dynamic management views simplifies access to this information.

以下部分介绍了 Azure SQL 数据仓库中用于优化查询性能的关键选项。The following sections explain the key options within Azure SQL Data Warehouse for tuning query performance. 现有环境将包含与目标环境中的潜在优化有关的信息。Existing environments will contain information about potential optimization in the target environment.

临时表Temporary tables

Azure Synapse Analytics 支持仅对创建它们的会话可见的临时表。Azure Synapse Analytics supports temporary tables that are visible only to the session in which they were created. 它们存在于用户会话的持续时间内,并在会话结束时自动删除。They exist for the duration of a user session and are automatically dropped at the end of the session.

若要创建临时表,请在表名称前面加上 () 的哈希字符 #To create a temporary table, prefix the table name with the hash character (#). 您可以将所有常用的索引和分发选项用于临时表,如下一节中所述。You can use all the usual indexing and distribution options with temporary tables, as described in the next section.

临时表有一些限制:Temporary tables have some restrictions:

  • 不允许重命名它们。Renaming them isn't allowed.
  • 不允许查看或分区它们。Viewing or partitioning them isn't allowed.
  • 不允许更改权限。Changing permissions isn't allowed.

临时表通常用于 ETL/ELT 处理,其中暂时性的中间结果作为转换过程的一部分。Temporary tables are commonly used within ETL/ELT processing, where transient intermediate results are used as part of a transformation process.

表分发选项Table distribution options

Azure Synapse Analytics 是一个 MPP 数据库系统,它通过在多个处理节点上并行运行来实现性能和可扩展性。Azure Synapse Analytics is an MPP database system that achieves performance and scalability by running in parallel across multiple processing nodes.

在多节点环境下运行 SQL 查询的理想处理方案是平衡工作负荷,并为所有节点提供等量的数据来处理。The ideal processing scenario for running an SQL query in a multinode environment is to balance the workload and give all nodes an equal amount of data to process. 此方法还允许您最大程度地减少或消除必须在节点之间移动以满足查询的数据量。This approach also allows you to minimize or eliminate the amount of data that has to be moved between nodes to satisfy the query.

实现理想的方案可能会很困难,因为在事实数据表和维度表之间,典型的分析查询和多个表之间经常有聚合。It can be challenging to achieve the ideal scenario because there are often aggregations in typical analytics queries and multiple joins between several tables, as between fact and dimension tables.

影响查询处理方式的一种方法是使用 Azure Synapse 分析中的分发选项来指定每个表的单个数据行的存储位置。One way to influence how queries are processed is to use the distribution options within Azure Synapse Analytics to specify where each table's individual data rows are stored. 例如,假定在数据列上联接了两个大型表 CUSTOMER_IDFor example, assume that two large tables are joined on the data column, CUSTOMER_ID. 通过在执行该联接时通过列分配两个表 CUSTOMER_ID ,可以确保联接的每一方的数据都已共存于同一处理节点上。By distributing the two tables through the CUSTOMER_ID columns whenever that join is performed, you can ensure that the data from each side of the join will already be co-located on the same processing node. 此方法无需在节点之间移动数据。This method eliminates the need to move data between nodes. 表的分布规范在语句中定义 CREATE TABLEThe distribution specification for a table is defined in the CREATE TABLE statement.

以下各节介绍了可用的分发选项以及何时使用它们的建议。The following sections describe the available distribution options and recommendations for when to use them. 如果需要,可以在初始加载后更改表的分布情况,如有必要,可通过使用语句重新创建包含新分布的表 CREATE TABLE AS SELECTIt's possible to change the distribution of a table after the initial load, if necessary: re-create the table with the new distribution by using the CREATE TABLE AS SELECT statement.

轮循机制Round robin

轮循机制表分布是默认选项,它将数据均匀地分布到系统中的各个节点上。Round-robin table distribution is the default option and spreads the data evenly across the nodes in the system. 此方法适用于快速数据加载以及卷中相对较低的数据,并且不具有明显的哈希。This method is good for fast data loading and for data that's relatively low in volume and doesn't have an obvious candidate for hashing. 它经常用于作为 ETL 或 ELT 过程的一部分的临时表。It's frequently used for staging tables as part of an ETL or ELT process.

计算Hashed

系统将行分配到哈希存储桶,该任务基于应用于用户定义的键的哈希算法,如 CUSTOMER_ID 前面的示例所示。The system assigns the row to a hash bucket, a task based on a hashing algorithm applied to a user-defined key like CUSTOMER_ID in the preceding example. 然后,将 bucket 分配给特定的节点,并在相同的值上在同一处理节点上对所有数据行进行哈希分布。The bucket is then assigned to a specific node, and all data rows hash-distributed on the same value end up on the same processing node.

此方法对于经常在键上联接或聚合的大型表非常有用。This method is useful for large tables that are frequently joined or aggregated on a key. 其他要联接的大型表应尽可能用相同的密钥进行哈希处理。Other large tables to be joined should be hashed on the same key if possible. 如果哈希键有多个候选项,请选择最常联接的键。If there are multiple candidates for the hash key, choose the most frequently joined one.

哈希列不应包含 null,并且通常不是日期,因为许多查询按日期进行筛选。The hash column shouldn't contain nulls and isn't typically a date because many queries filter on date. 如果哈希的键是整数值而不是整数值,则哈希通常更有效 CHAR VARCHARHashing is typically more efficient if the key to hash is an integer value instead CHAR or VARCHAR. 避免选择具有高度倾斜的值范围的键,如少量键值表示大量数据行时。Avoid choosing keys with a highly skewed range of values, like when a small number of key values represent a large percentage of the data rows.

复制Replicated

选择 "复制为表的分发选项" 将导致表的完整副本在每个计算节点上进行复制,以便进行查询处理。Choosing replicated as the distribution option for a table will cause a complete copy of that table to be replicated on each compute node for query processing purposes.

此方法对于相对较小的表很有用 (通常小于 2 GB 的压缩) ,这些压缩相对静态,并且通常通过同等联接联接到较大的表。This approach is useful for relatively small tables (typically less than 2 GB compressed) that are relatively static and frequently joined to larger tables via an equi-join. 这些表通常是星型架构中的维度表。These tables are often dimensional tables in a star schema.

索引Indexing

Azure Synapse Analytics 包含用于对大型表中的数据进行索引的选项,以减少检索记录所需的资源和时间:Azure Synapse Analytics includes options for indexing data in large tables to reduce the resources and time required to retrieve records:

  • 聚集列存储索引Clustered columnstore index
  • 聚集索引Clustered index
  • 非聚集索引Non-clustered index

HEAP对于不能从任何索引选项受益的表,都存在非索引选项。A non-indexed option, HEAP, exists for tables that don't benefit from any of the index options. 在改进查询时间与较长的加载时间以及更多存储空间使用量之间,使用索引是一种平衡。Using indexes is a trade-off between improved query times versus longer load times and usage of more storage space. 索引通常 SELECT UPDATE DELETE MERGE 会在影响数据行的一小部分的大型表上加速、、和操作,并且它们可以最大程度地减少全表扫描。Indexes often speed up SELECT, UPDATE, DELETE, and MERGE operations on large tables that affect a small percentage of the data rows, and they can minimize full table scans.

UNIQUE 列上定义或约束时,将自动创建索引 PRIMARY KEYIndexes are automatically created when UNIQUE or PRIMARY KEY constraints are defined on columns.

聚集列存储索引Clustered columnstore index

聚集列存储索引是 Azure Synapse 分析中的默认索引选项。Clustered columnstore index is the default indexing option within Azure Synapse Analytics. 它为大表提供最佳压缩和查询性能。It provides the best compression and query performance for large tables. 对于少于60000000行的小型表,这些索引不会有效,因此应使用 HEAP 选项。For smaller tables of fewer than 60 million rows, these indexes aren't efficient, so you should use the HEAP option. 同样,如果表中的数据是暂时性的并且是 ETL/ELT 进程的一部分,则堆或临时表的效率可能更高。Similarly, a heap or a temporary table might be more efficient if the data in a table is transient and part of an ETL/ELT process.

聚集索引Clustered index

如果需要根据强筛选条件从大型表中定期检索单个行或少量的行,则聚集索引可能比聚集列存储索引更有效。If there's a requirement to regularly retrieve a single row or small number of rows from a large table based on a strong filter condition, a clustered index might be more efficient than a clustered columnstore index. 每个表只允许有一个聚集索引。Only one clustered index is allowed per table.

非聚集索引Non-clustered index

非聚集索引类似于聚集索引,因为它们可以根据筛选条件加速单个行或少量行的检索。Non-clustered indexes are similar to clustered indexes in that they can speed up retrieval of single rows or a small number of rows based on a filter condition. 在内部,非聚集索引与数据分开存储,可以在表上定义多个非聚集索引。Internally, non-clustered indexes are stored separately from the data, and multiple non-clustered indexes can be defined on a table. 但是,每个附加索引需要更多的存储空间,并且会降低数据插入或加载的吞吐量。However, each additional index will require more storage and will reduce the throughput of data insert or loading.

Heap

堆表在数据加载时不会产生与索引的创建和维护相关的任何开销。Heap tables incur none of the overhead associated with the creation and maintenance of indexes at data load time. 它们有助于在进程中快速加载暂时性数据,包括 ELT 过程。They can help to quickly load transient data during processes, including ELT processes. 缓存还有助于立即读取数据。Caching can also assist when the data is read immediately afterward. 由于聚集列存储索引低于60000000行,因此,堆表还有助于存储行小于此数量的表。Because clustered columnstore indexes are inefficient below 60 million rows, heap tables can also help to store tables with rows less than this amount.

数据分区Data partitioning

在企业数据仓库中,事实数据表可以包含很多数十亿行。In an enterprise data warehouse, fact tables can contain many billions of rows. 分区是一种优化这些表的维护和查询的方法,方法是将它们拆分为单独的部分,以减少运行查询时处理的数据量。Partitioning is a way to optimize the maintenance and querying of these tables by splitting them into separate parts to reduce the amount of data processed when running queries. 表的分区规范在语句中定义 CREATE TABLEThe partitioning specification for a table is defined in the CREATE TABLE statement.

对于分区,每个表只能使用一个字段。You can use only one field per table for partitioning. 这通常是一个日期字段,因为多个查询按日期或日期范围进行筛选。It's frequently a date field because many queries are filtered by a date or date range. 如果需要,可以在初始加载后更改表的分区(如有必要),方法是使用新的分发通过语句重新创建该表 CREATE TABLE AS SELECTYou can change the partitioning of a table after initial load, if necessary, by re-creating the table with the new distribution through the CREATE TABLE AS SELECT statement.

查询优化的分区: 如果对大型事实数据表的查询经常按某个数据列进行筛选,则对该列进行分区可显著减少需要处理的数据量,以执行查询。Partitioning for query optimization: If queries against a large fact table are frequently filtered by a certain data column, then partitioning on that column can significantly reduce the amount of data that needs to be processed to perform the queries. 常见的示例是使用日期字段将表拆分为较小的组。A common example is to use a date field to split the table into smaller groups. 每个组包含一天的数据。Each group contains data for a single day. 当查询包含 WHERE 筛选日期的子句时,只需访问与日期筛选器匹配的分区。When a query contains a WHERE clause that filters on the date, only partitions that match the date filter need to be accessed.

表维护优化的分区: 数据仓库环境经常维护详细事实数据的滚动窗口。Partitioning for optimization of table maintenance: It's common for data warehouse environments to maintain a rolling window of detailed fact data. 例如,5年后返回的销售事务。An example is sales transactions that go back five years. 通过对销售日期进行分区,在滚动窗口之外删除旧数据的效率会更高。By partitioning on the sales date, the removal of old data beyond the rolling window becomes much more efficient. 删除最旧的分区比删除每个单独的行更快且使用的资源更少。Dropping the oldest partition is quicker and uses fewer resources than deleting each individual row.

统计信息Statistics

将查询提交到 Azure Synapse Analytics 时,查询优化器将首先对其进行处理。When a query is submitted to Azure Synapse Analytics, it's first processed by the query optimizer. 优化器会确定有效地执行查询的最佳内部方法。The optimizer determines the best internal methods to execute the query efficiently.

优化器会根据基于开销的算法来比较可用的各种查询执行计划。The optimizer compares the various query-execution plans that are available based on a cost-based algorithm. 成本估算的准确性取决于可用的统计信息。The accuracy of the cost estimates is dependent on the statistics available. 最佳做法是确保统计信息是最新的。It's a good practice to ensure that statistics are up to date.

在 Azure Synapse Analytics 中,如果 AUTO_CREATE_STATISTICS 启用了该选项,它将触发统计信息的自动更新。In Azure Synapse Analytics, if the AUTO_CREATE_STATISTICS option is turned on, it will trigger an automatic update of statistics. 还可以通过命令手动创建或更新统计信息 CREATE STATISTICSYou can also create or update statistics manually via the CREATE STATISTICS command.

如果内容发生了重大更改,例如在日常更新中,请刷新统计信息。Refresh statistics when the contents have changed substantially, such as in a daily update. 此刷新可合并到 ETL 过程中。This refresh can be incorporated into an ETL process.

数据库中的所有表都应至少在一列中收集统计信息。All tables in the database should have statistics collected on at least one column. 它可确保优化器可以使用基本信息(如行计数和表大小)。It ensures that basic information such as row count and table size is available to the optimizer. 应收集统计信息的其他列是在 JOINDISTINCTORDER BY 和处理中指定的列 GROUP BYOther columns that should have statistics collected are columns specified in JOIN, DISTINCT, ORDER BY, and GROUP BY processing.

工作负荷管理Workload management

Azure Synapse Analytics 包含跨混合工作负荷管理资源利用率的综合功能。Azure Synapse Analytics incorporates comprehensive features for managing resource utilization across mixed workloads. 为不同的工作负荷类型(例如查询与数据加载)创建资源类可帮助您管理工作负荷。Creating resource classes for different workload types, such as queries versus data load, helps you manage your workload. 它对在分配给每个查询的计算资源上并发运行的查询数进行限制。It sets limits on the number of queries that run concurrently and on the compute resources assigned to each query. 内存和并发性之间的权衡:There's a trade-off between memory and concurrency:

  • 较小的资源类会减少每个查询的最大内存,但会提高并发性。Smaller resource classes reduce the maximum memory per query but increase concurrency.
  • 较大的资源类会增加每个查询的最大内存,但会降低并发性。Larger resource classes increase the maximum memory per query but reduce concurrency.

性能建议Performance recommendations

使用性能提高方法(如索引或数据分发)来测量新目标环境中类似方法的候选项,但要在 Azure Synapse Analytics 中确认是否需要这些方法。Use performance improvement methods like indexes or data distribution to gauge candidates for similar methods in the new target environment, but benchmark to confirm that they're necessary in Azure Synapse Analytics. COLLECT STATISTICS将步骤生成到 ETL/ELT 进程以确保统计信息是最新的,或者选择自动创建统计信息。Build COLLECT STATISTICS steps into ETL/ELT processes to ensure that statistics are up to date, or select to automatically create statistics.

了解 Azure Synapse 分析中可用的优化选项,以及关联的实用程序的性能特征,例如,PolyBase 用于快速并行数据加载。Understand the tuning options available in Azure Synapse Analytics and the performance characteristics of associated utilities, such as PolyBase for fast parallel data loading. 使用这些选项可以构建有效的端到端实现。Use these options to build an efficient end-to-end implementation.

使用 Azure 环境的灵活性、可伸缩性和性能来实现任何数据模型更改或性能优化选项。Use the flexibility, scalability, and performance of the Azure environment to implement any data model changes or performance tuning options in place. 这一努力将减少对现有源系统的影响。This effort will reduce the impact on existing source systems.

了解 Azure Synapse 分析中的动态管理视图。Understand the dynamic management views available in Azure Synapse Analytics. 这些视图提供了系统范围的资源使用情况信息以及各个查询的详细执行信息。These views provide both system-wide resource utilization information and detailed execution information for individual queries.

了解 Azure 资源类并对其进行相应的分配,以确保有效管理混合工作负荷和并发。Understand Azure resource classes and allocate them appropriately to ensure efficient management of mixed workloads and concurrency.

请考虑在 Azure Synapse 分析环境中使用虚拟化层。Consider using a virtualization layer as part of the Azure Synapse Analytics environment. 它可以从业务用户和报告工具中防护仓库实现中的更改。It can shield changes in the warehouse implementation from business users and reporting tools.

研究合作伙伴提供的迁移工具和服务,例如,Qlik sense 在 Microsoft 迁移、WhereScape 和 Datometry hyper-q 中进行复制。Research partner-provided migration tools and services such as Qlik Replicate for Microsoft migrations, WhereScape, and Datometry. 这些服务可以自动执行迁移过程的各个部分,并减少迁移项目中的运行时间和风险。These services can automate parts of the migration process and reduce the elapsed time and risk involved in a migration project.