开始使用列存储适进行实时运行分析Get started with Columnstore for real time operational analytics

适用于: 是SQL Server是Azure SQL 数据库否Azure SQL 数据仓库否并行数据仓库APPLIES TO: yesSQL Server yesAzure SQL Database noAzure SQL Data Warehouse noParallel Data Warehouse

SQL Server 2016 引入了实时运营分析,可以同时对同一个数据库表运行分析和 OLTP 工作负载。SQL Server 2016 introduces real-time operational analytics, the ability to run both analytics and OLTP workloads on the same database tables at the same time. 除了实时运行分析以外,你还不需要使用 ETL 和数据仓库。Besides running analytics in real-time, you can also eliminate the need for ETL and a data warehouse.

实时运行分析介绍Real-Time Operational Analytics Explained

在传统上,企业为操作工作负载(例如OLTP)和分析工作负载使用不同的系统。Traditionally, businesses have had separate systems for operational (i.e. OLTP) and analytics workloads. 对于此类系统,提取、转换和加载 (ETL) 作业会定期将数据从操作存储转移到分析存储。For such systems, Extract, Transform, and Load (ETL) jobs regularly move the data from the operational store to an analytics store. 分析数据通常存储在专用于运行分析查询的数据仓库或数据市场中。The analytics data is usually stored in a data warehouse or data mart dedicated to running analytics queries. 尽管这种解决方案已成为标准,但在以下三个方面存在很大问题:While this solution has been the standard, it has these three key challenges:

  • 复杂性。Complexity. 实施 ETL 可能需要编码相当多的代码,尤其是只想要加载修改的行时。Implementing ETL can require considerable coding especially to load only the modified rows. 识别哪些行已被修改是一个复杂的过程。It can be complex to identify which rows have been modified.

  • 高成本。Cost. 实施 ETL 需要付出采购额外硬件和软件许可证的成本。Implementing ETL requires the cost of purchasing additional hardware and software licenses.

  • 数据滞后时间。Data Latency. 实施 ETL 会增大运行分析的时间延迟。Implementing ETL adds a time delay for running the analytics. 例如,如果在每个工作日结束时运行 ETL 作业,则分析查询至少需要针对一天的数据运行。For example, if the ETL job runs at the at end of each business day, the analytics queries will run on data that is at least a day old. 对于许多企业来说,这种延迟不可接受,因为企业依赖于实时分析数据。For many businesses this delay is unacceptable because the business depends on analyzing data in real-time. 例如,欺诈检测需要实时分析操作数据。For example, fraud-detection requires real-time analytics on operational data.

实时运营分析概述real-time operational analytics overview

实时运行分析为这些难题提供了解决方案。Real-time operational analytics offers a solution to these challenges.
对同一个基础表运行分析和 OLTP 工作负载时不会出现时间延迟。There is no time delay when analytics and OLTP workloads run on the same underlying table. 对于使用实时分析的方案,成本和复杂性将大大降低,因为不需要使用 ETL,并且不需要采购和维护独立的数据仓库。For scenarios that can use real-time analytics, the costs and complexity are greatly reduced by eliminating the need for ETL and the need to purchase and maintain a separate data warehouse.

备注

实时运行分析面向包含单个数据源的方案,例如,可在其上运行操作和分析工作负载的企业资源规划 (ERP) 应用程序。Real-time operational analytics targets the scenario of a single data source such as an enterprise resource planning (ERP) application on which you can run both the operational and the analytics workload. 如果在运行分析工作负载之前需要集成多个源中的数据,或者你要使用预先聚合的数据(如多维数据集)实现极高的分析性能,则它不能取代独立的数据仓库。This does not replace the need for a separate data warehouse when you need to integrate data from multiple sources before running the analytics workload or when you require extreme analytics performance using pre-aggregated data such as cubes.

实时分析使用行存储表中的可更新列存储索引。Real-time analytics uses an updateable columnstore index on a rowstore table. 列存储索引维护数据的副本,因此 OLTP 和分析工作负载可针对数据的独立副本运行。The columnstore index maintains a copy of the data, so the OLTP and analytics workloads run against separate copies of the data. 这可以最大程度地降低对同时运行的两个工作负载的性能影响。This minimizes the performance impact of both workloads running at the same time. SQL Server 自动维护索引更改,因此,要分析的 OLTP 更改始终是最新的。SQL Server automatically maintains index changes so OLTP changes are always up-to-date for analytics. 使用这种设计,能够有效地对最新的数据运行实时分析。With this design, it is possible and practical to run analytics in real-time on up-to-date data. 这适用于基于磁盘的表和内存优化表。This works for both disk-based and memory-optimized tables.

入门示例Get Started Example

若要开始使用实时分析,请执行以下操作:To get started with real-time analytics:

  1. 识别操作架构中包含需要用于分析的数据的表。Identify the tables in your operational schema that contain data required for analytics.

  2. 对于每个表,删除主要用于提高 OLTP 工作负载上现有分析速度的所有 btree 索引。For each table, drop all btree indexes that are primarily designed to speed up existing analytics on your OLTP workload. 将这些索引替换为单个列存储索引。Replace them with a single columnstore index. 这可以提高 OLTP 工作负载的整体性能,因为可以减少要维护的索引。This can improve the overall performance of your OLTP workload since there will be fewer indexes to maintain.

    --This example creates a nonclustered columnstore index on an existing OLTP table.  
    --Create the table  
    CREATE TABLE t_account (  
        accountkey int PRIMARY KEY,  
        accountdescription nvarchar (50),  
        accounttype nvarchar(50),  
        unitsold int  
    );  
    
    --Create the columnstore index with a filtered condition  
    CREATE NONCLUSTERED COLUMNSTORE INDEX account_NCCI   
    ON t_account (accountkey, accountdescription, unitsold)   
    ;  
    
    

    通过集成内存中 OLTP 和内存中列存储技术来提供高性能 OLTP 和分析工作负载,可以对内存中表上的列存储索引进行操作分析。The columnstore index on an in-memory table allows operational analytics by integrating in-memory OLTP and in-memory columnstore technologies to deliver high performance for both OLTP and analytics workloads. 内存中表上的列存储索引必须包括所有列。The columnstore index on an in-memory table must include all the columns.

    -- This example creates a memory-optimized table with a columnstore index.  
    CREATE TABLE t_account (  
        accountkey int NOT NULL PRIMARY KEY NONCLUSTERED,  
        Accountdescription nvarchar (50),  
        accounttype nvarchar(50),  
        unitsold int,  
        INDEX t_account_cci CLUSTERED COLUMNSTORE  
        )  
        WITH (MEMORY_OPTIMIZED = ON );  
    GO  
    
    
  3. 这就是要执行的所有操作!This is all you need to do!

期待您的反馈 :如果在本文中发现过时或不正确的内容(如步骤或代码示例),请告诉我们。We are listening: If you find something outdated or incorrect in this article, such as a step or a code example, please tell us. 可以单击此页底部的“反馈” 部分中的“本页” 按钮。You can click the This page button in the Feedback section at the bottom of this page. 我们通常在第二天阅读有关 SQL 的每项反馈。We read every item of feedback about SQL, typically the next day. 谢谢。Thanks.

现在,无需对应用程序进行任何更改,就能运行实时运营分析。You are now ready to run real-time operational analytics without making any changes to your application. 分析查询将针对列存储索引运行,OLTP 操作将针对 OLTP btree 索引不断运行。Analytics queries will run against the columnstore index and OLTP operations will keep running against your OLTP btree indexes. OLTP 工作负载将继续执行,但维护列存储索引会产生更多的开销。The OLTP workloads will continue to perform, but will incur some additional overhead to maintain the columnstore index. 请参阅下一部分中有关性能优化的信息。See the performance optimizations in the next section.

博客文章Blog Posts

若要了解有关实时运营分析的详细信息,请阅读 Sunil Agarwal 的博客文章。Read Sunil Agarwal's blog posts to learn more about real-time operational analytics. 如果你先阅读这些博客文章,则可以更容易理解性能提示部分。It might be easier to understand the performance tips sections if you look at the blog posts first.

性能提示 1:使用筛选索引来提升查询性能Performance tip #1: Use filtered indexes to improve query performance

运行实时运营分析可能会影响 OLTP 工作负载的性能。Running real-time operational analytics can impact the performance of the OLTP workload. 这种影响应该很小。This impact should be minimal. 以下示例演示如何使用筛选索引来最大程度地降低事务工作负载上非聚集列存储索引的影响,同时仍能提供实时分析。The example below shows how to use filtered indexes to minimize impact of nonclustered columnstore index on transactional workload while still delivering analytics in real-time.

为了尽量减少维护操作工作负载上非聚集列存储索引的开销,你可以使用筛选条件,以便只对 数据或缓慢变化的数据创建非聚集列存储索引。To minimize the overhead of maintaining a nonclustered columnstore index on an operational workload, you can use a filtered condition to create a nonclustered columnstore index only on the warm or slowly changing data. 例如,在订单管理应用程序中,可以针对已发货的订单创建非聚集列存储索引。For example, in an order management application, you can create a nonclustered columnstore index on the orders that have already been shipped. 订单在发货后,就很少会发生变化,因此被视为暖数据。Once the order has shipped, it rarely changes and therefore can be considered warm data. 使用筛选索引时,非聚集列存储索引中的数据只需少量的更新,因此可降低对事务工作负载的影响。With Filtered index, the data in nonclustered columnstore index requires fewer updates thereby lowering the impact on transactional workload.

分析查询将根据需要以透明方式访问暖数据和热数据,以提供实时分析。Analytics queries transparently access both warm and hot data as needed to provide real-time analytics. 如果操作工作负载的重要部分处理热数据,这些操作不需要额外维护列存储索引。If a significant part of the operational workload is touching the 'hot' data, those operations will not require additional maintenance of the columnstore index. 最佳做法是对筛选索引定义中使用的列使用行存储聚集索引。A best practice is to have a rowstore clustered index on the column(s) used in the filtered index definition. SQL Server 使用聚集索引来快速扫描不符合筛选条件的行。SQL Server uses the clustered index to quickly scan the rows that did not meet the filtered condition. 如果不使用此聚集索引,则需要对行存储表进行完全表扫描才能找到这些行,从而可能会明显降低分析查询的性能。Without this clustered index, a full table scan of the rowstore table will be required to find these rows which can negatively impact the performance of analytics query significantly. 如果不使用聚集索引,你可以创建一个互补筛选的非聚集 btree 索引来标识这些行,但我们不建议这样做,因为通过非聚集 btree 索引访问大量的行会造成很大的开销。In the absence of clustered index, you could create a complementary filtered nonclustered btree index to identify such rows but it is not recommended because accessing large range of rows through nonclustered btree indexes is expensive.

备注

只有基于磁盘的表才支持筛选的非聚集列存储索引。A filtered nonclustered columnstore index is only supported on disk-based tables. 内存优化表不支持此类索引It is not supported on memory-optimized tables

示例 A:从 B 树索引访问热数据,从列存储索引访问温数据Example A: Access hot data from btree index, warm data from columnstore index

此示例使用筛选条件 (accountkey > 0) 来确定要将哪些行放在列存储索引中。This example uses a filtered condition (accountkey > 0) to establish which rows will be in the columnstore index. 目的是设计筛选条件和后续查询,以便从 btree 索引访问经常变化的“热”数据,从列存储索引访问更稳定的“暖”数据。The goal is to design the filtered condition and subsequent queries to access frequently changing "hot" data from the btree index, and to access the more stable "warm" data from the columnstore index.

用于暖数据和热数据的组合索引Combined indexes for warm and hot data

备注

查询优化器将考虑但不是总是选择列存储索引用于查询计划。The query optimizer will consider, but not always choose, the columnstore index for the query plan. 当查询优化器选择筛选的列存储索引时,将以透明方式合并来自列存储索引的行以及不符合筛选条件的行,以便能够进行实时分析。When the query optimizer chooses the filtered columnstore index, it transparently combines the rows both from columnstore index as well as the rows that do not meet the filtered condition to allow real-time analytics. 这不同于常规的非聚集筛选索引,后者只能在将自身限制为索引中存在的行的查询中使用。This is different from a regular nonclustered filtered index which can be used only in queries that restrict themselves to the rows present in the index.

--Use a filtered condition to separate hot data in a rowstore table  
-- from "warm" data in a columnstore index.  
  
-- create the table  
CREATE TABLE  orders (  
         AccountKey         int not null,  
         Customername       nvarchar (50),  
        OrderNumber         bigint,  
        PurchasePrice       decimal (9,2),  
        OrderStatus         smallint not null,  
        OrderStatusDesc     nvarchar (50))  
  
-- OrderStatusDesc  
-- 0 => 'Order Started'  
-- 1 => 'Order Closed'  
-- 2 => 'Order Paid'  
-- 3 => 'Order Fullfillment Wait'  
-- 4 => 'Order Shipped'  
-- 5 => 'Order Received'  
  
CREATE CLUSTERED INDEX  orders_ci ON orders(OrderStatus)  
  
--Create the columnstore index with a filtered condition  
CREATE NONCLUSTERED COLUMNSTORE INDEX orders_ncci ON orders  (accountkey, customername, purchaseprice, orderstatus)  
where orderstatus = 5  
;  
  
-- The following query returns the total purchase done by customers for items > $100 .00  
-- This query will pick  rows both from NCCI and from 'hot' rows that are not part of NCCI  
SELECT top 5 customername, sum (PurchasePrice)  
FROM orders  
WHERE purchaseprice > 100.0   
Group By customername  

分析查询将配合以下查询计划执行。The analytics query will execute with the following query plan. 你可以看到,不符合筛选条件的行是通过聚集 btree 索引访问的。You can see that the rows not meeting the filtered condition are accessed through clustered btree index.

查询计划Query plan

有关 筛选的聚集列存储索引的详细信息,请参阅博客Please refer to the blog for details on filtered nonclustered columnstore index.

性能提示 2:将分析负载转移到 Always On 可读次要副本Performance tip #2: Offload analytics to Always On readable secondary

尽管你可以使用筛选列存储索引来尽量减少列存储索引维护,但分析查询可能仍需要大量计算资源(CPU、IO、内存),这会影响操作工作负载的性能。Even though you can minimize the columnstore index maintenance by using a filtered columnstore index, the analytics queries can still require significant computing resources (CPU, IO, memory) which impact the operational workload performance. 对于大多数任务关键型工作负载,我们建议使用 AlwaysOn 配置。For most mission critical workloads, our recommendation is to use the Always On configuration. 在这种配置,你可以通过将运行中的分析负载转移到可读辅助副本来消除影响。In this configuration, you can eliminate the impact of running analytics by offloading it to a readable secondary.

性能提示 3:通过在增量行组中保存热数据来减少索引碎片Performance Tip #3: Reducing Index fragmentation by keeping hot data in delta rowgroups

如果工作负载更新/删除了已压缩的行,则包含列存储索引的表可能会出现大量碎片(例如已删除的行)。Tables with columnstore index may get significantly fragmented (i.e. deleted rows) if the workload updates/deletes rows that have been compressed. 有碎片的列存储索引会导致内存/存储利用效率下降。A fragmented columnstore index leads to inefficient utilization of memory/storage. 除了资源的低效利用以外,还会对分析查询性能造成负面影响,因为需要额外的 IO,并且需要从结果集中筛选出已删除的行。Besides inefficient use of resources, it also negatively impacts the analytics query performance because of extra IO and the need to filter the deleted rows from the result set.

在使用 REORGANIZE 命令运行索引碎片整理或者在整个表或受影响的分区上重新生成列存储索引之前,已删除的行实际上并未删除。The deleted rows are not physically removed until you run index defragmentation with REORGANIZE command or rebuild the columnstore index on the entire table or the affected partition(s). REORGANIZE 和索引 REBUILD 是高开销的操作,会占用本应提供给工作负载的资源。Both REORGANIZE and Index REBUILD are expensive operations taking resources away which otherwise could be used for the workload. 此外,如果过早压缩行,可能会由于更新而需要重新压缩多次,从而导致压缩开销的浪费。Additionally, if rows compressed too early, it may need to be re-compressed multiple times due to updates leading to wasted compression overhead.
可以使用 COMPRESSION_DELAY 选项来尽量减少索引碎片。You can minimize index fragmentation using COMPRESSION_DELAY option.

  
-- Create a sample table  
create table t_colstor (  
               accountkey                      int not null,  
               accountdescription              nvarchar (50) not null,  
               accounttype                     nvarchar(50),  
               accountCodeAlternatekey         int)  
  
-- Creating nonclustered columnstore index with COMPRESSION_DELAY. The columnstore index will keep the rows in closed delta rowgroup for 100 minutes   
-- after it has been marked closed  
CREATE NONCLUSTERED COLUMNSTORE index t_colstor_cci on t_colstor (accountkey, accountdescription, accounttype)   
                       WITH (DATA_COMPRESSION= COLUMNSTORE, COMPRESSION_DELAY = 100);  
  
;  

有关 压缩延迟的详细信息,请参阅博客Please refer to the blog for details on compression delay.

以下是建议的最佳实践Here are the recommended best practices

  • 插入/查询工作负载: 如果工作负载主要是插入数据和查询数据,则建议将 COMPRESSION_DELAY 的默认值设置为 0。Insert/Query workload: If your workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option. 在单个增量行组中插入 100 万行后,新插入的行将被压缩。The newly inserted rows will get compressed once 1 million rows have been inserted into a single delta rowgroup.
    此类工作负载的某些示例包括:(a) 传统的 DW 工作负载 (b) 需要分析 Web 应用程序中的点击模式时执行的点击流分析。Some example of such workload are (a) traditional DW workload (b) click-stream analysis when you need to analyze the click pattern in a web application.

  • OLTP 工作负载: 如果工作负载频繁执行 DML(即大量混合更新、删除和插入操作),可以通过检查 DMV sys.OLTP workload: If the workload is DML heavy (i.e. heavy mix of Update, Delete and Insert), you may see columnstore index fragmentation by examining the DMV sys. dm_db_column_store_row_group_physical_stats 来查看列存储索引碎片。dm_db_column_store_row_group_physical_stats. 如果你看到在最近压缩的行组中,10% 以上的行标记为已删除,则可以使用 COMPRESSION_DELAY 选项来增加时间延迟,达到该延迟后,行可供压缩。If you see that > 10% rows are marked deleted in recently compressed rowgroups, you can use COMPRESSION_DELAY option to add time delay when rows become eligible for compression. 例如,对于工作负载,如果新插入的数据保持“热”状态(即多次更新)60 分钟,则应该将 COMPRESSION_DELAY 指定为 60。For example, if for your workload, the newly inserted stays 'hot' (i.e. gets updated multiple times) for say 60 minutes, you should choose COMPRESSION_DELAY to be 60.

大多数客户应该不需要采取任何措施。We expect most customers do not need to do anything. COMPRESSION_DELAY 选项的默认值应可满足需要。The default value of COMPRESSION_DELAY option should work for them.
对于高级用户,我们建议运行以下查询并收集过去 7 天已删除的行的百分比。For advance users, we recommend running the query below and collect % of deleted rows over the last 7 days.

SELECT row_group_id,cast(deleted_rows as float)/cast(total_rows as float)*100 as [% fragmented], created_time  
FROM sys. dm_db_column_store_row_group_physical_stats  
WHERE object_id = object_id('FactOnlineSales2')   
             AND  state_desc='COMPRESSED'   
             AND deleted_rows>0   
             AND created_time > GETDATE() - 7  
ORDER BY created_time DESC  

如果压缩行组中已删除的行数超过 20%,则平整变化率小于 5% 的旧行组(称为冷行组)会将 COMPRESSION_DELAY 设置为 (youngest_rowgroup_created_time - current_time)。If the number of deleted rows in compressed rowgroups > 20%, plateauing in older rowgroups with < 5% variation (referred to as cold rowgroups) set COMPRESSION_DELAY = (youngest_rowgroup_created_time - current_time). 请注意,这种方法最适合用于稳定且性质相对统一的工作负载。Note that this approach works best with a stable and relatively homogeneous workload.

另请参阅See Also

列存储索引指南 Columnstore Indexes Guide
列存储索引数据加载 Columnstore Indexes Data Loading
Columnstore Indexes Query Performance Columnstore Indexes Query Performance
针对数据仓库的列存储索引 Columnstore Indexes for Data Warehousing
列存储索引碎片整理Columnstore Indexes Defragmentation