Azure SQL DW Performance: CTAS/Partition Switching VS UPDATE/DELETE

DML operations can be resource intensive and harmful to CCI indexes in Azure SQL Data Warehouse. The main performance benefit comes from UPDATE & DELETE operations being fully logged and CTAS is minimally logged. There is also advantages to recreating the CCI index VS modifying it with a DML statement. The preferred method is to utilize a methodology of CTAS and partition switching in lieu of UPDATE and DELETE operations wherever possible. The below example shows the performance benefit of doing so. We will be using TPCH 1TB dat set.

Example:

  • The DW in use is a DW500 Gen 1
  • All statements were executed in xlrgRC
  • The business case here is we want to upgrade the shipping priority for all orders placed last week. For this example the week in question will be 1998-07-24 - 1998-07-31
  • We will be using the [orders] table with CCI, hash distributed on o_orderkey, and partitioned on o_orderdate by month.
  • DDL:
  • CREATE TABLE [dbo].[orders_part] ( [o_orderkey] bigint NULL, [o_custkey] bigint NULL, [o_orderstatus] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [o_totalprice] decimal(15, 2) NULL, [o_orderdate] date NULL, [o_orderpriority] char(15) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [o_clerk] char(15) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [o_shippriority] int NULL, [o_comment] varchar(79) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([o_orderkey]), PARTITION ([o_orderdate] RANGE RIGHT FOR VALUES ('Jan 1 1992', 'Feb 1 1992', 'Mar 1 1992', 'Apr 1 1992', 'May 1 1992', 'Jun 1 1992', 'Jul 1 1992', 'Aug 1 1992', 'Sep 1 1992', 'Oct 1 1992', 'Nov 1 1992', 'Dec 1 1992', 'Jan 1 1993', 'Feb 1 1993', 'Mar 1 1993', 'Apr 1 1993', 'May 1 1993', 'Jun 1 1993', 'Jul 1 1993', 'Aug 1 1993', 'Sep 1 1993', 'Oct 1 1993', 'Nov 1 1993', 'Dec 1 1993', 'Jan 1 1994', 'Feb 1 1994', 'Mar 1 1994', 'Apr 1 1994', 'May 1 1994', 'Jun 1 1994', 'Jul 1 1994', 'Aug 1 1994', 'Sep 1 1994', 'Oct 1 1994', 'Nov 1 1994', 'Dec 1 1994', 'Jan 1 1995', 'Feb 1 1995', 'Mar 1 1995', 'Apr 1 1995', 'May 1 1995', 'Jun 1 1995', 'Jul 1 1995', 'Aug 1 1995', 'Sep 1 1995', 'Oct 1 1995', 'Nov 1 1995', 'Dec 1 1995', 'Jan 1 1996', 'Feb 1 1996', 'Mar 1 1996', 'Apr 1 1996', 'May 1 1996', 'Jun 1 1996', 'Jul 1 1996', 'Aug 1 1996', 'Sep 1 1996', 'Oct 1 1996', 'Nov 1 1996', 'Dec 1 1996', 'Jan 1 1997', 'Feb 1 1997', 'Mar 1 1997', 'Apr 1 1997', 'May 1 1997', 'Jun 1 1997', 'Jul 1 1997', 'Aug 1 1997', 'Sep 1 1997', 'Oct 1 1997', 'Nov 1 1997', 'Dec 1 1997', 'Jan 1 1998', 'Feb 1 1998', 'Mar 1 1998', 'Apr 1 1998', 'May 1 1998', 'Jun 1 1998', 'Jul 1 1998', 'Aug 1 1998')));
  • To executed the update we simply run the following command:
  • UPDATE orders_part SET o_shippriority = 1 WHERE o_orderdate BETWEEN '1998-07-24' AND '1998-07-31'
  • This statements completes in about 25 seconds and updates 4,987,237 rows
  • A CTAS + Partition switch process will look like the following
  • Step 1: Create a copy of the table with the data form the affected partition, in this case partition 79 with a right boundary value of July 1 1998
  • --create a temp table with matching paritioning --perform transformation here to avoid additional step CREATE TABLE orders_part_switch_temp WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([o_orderkey]), PARTITION ([o_orderdate] RANGE RIGHT FOR VALUES ('Jan 1 1992', 'Feb 1 1992', 'Mar 1 1992', 'Apr 1 1992', 'May 1 1992', 'Jun 1 1992', 'Jul 1 1992', 'Aug 1 1992', 'Sep 1 1992', 'Oct 1 1992', 'Nov 1 1992', 'Dec 1 1992', 'Jan 1 1993', 'Feb 1 1993', 'Mar 1 1993', 'Apr 1 1993', 'May 1 1993', 'Jun 1 1993', 'Jul 1 1993', 'Aug 1 1993', 'Sep 1 1993', 'Oct 1 1993', 'Nov 1 1993', 'Dec 1 1993', 'Jan 1 1994', 'Feb 1 1994', 'Mar 1 1994', 'Apr 1 1994', 'May 1 1994', 'Jun 1 1994', 'Jul 1 1994', 'Aug 1 1994', 'Sep 1 1994', 'Oct 1 1994', 'Nov 1 1994', 'Dec 1 1994', 'Jan 1 1995', 'Feb 1 1995', 'Mar 1 1995', 'Apr 1 1995', 'May 1 1995', 'Jun 1 1995', 'Jul 1 1995', 'Aug 1 1995', 'Sep 1 1995', 'Oct 1 1995', 'Nov 1 1995', 'Dec 1 1995', 'Jan 1 1996', 'Feb 1 1996', 'Mar 1 1996', 'Apr 1 1996', 'May 1 1996', 'Jun 1 1996', 'Jul 1 1996', 'Aug 1 1996', 'Sep 1 1996', 'Oct 1 1996', 'Nov 1 1996', 'Dec 1 1996', 'Jan 1 1997', 'Feb 1 1997', 'Mar 1 1997', 'Apr 1 1997', 'May 1 1997', 'Jun 1 1997', 'Jul 1 1997', 'Aug 1 1997', 'Sep 1 1997', 'Oct 1 1997', 'Nov 1 1997', 'Dec 1 1997', 'Jan 1 1998', 'Feb 1 1998', 'Mar 1 1998', 'Apr 1 1998', 'May 1 1998', 'Jun 1 1998', 'Jul 1 1998', 'Aug 1 1998'))) AS SELECT [o_orderkey] , [o_custkey] , [o_orderstatus] , [o_totalprice] , [o_orderdate] , [o_orderpriority] , [o_clerk] , CASE WHEN o_orderdate BETWEEN '1998-07-24' AND '1998-07-31' THEN 1 ELSE 0 END AS CHAR(15) AS [o_shippriority], [o_comment] FROM dbo.orders_part_switch WHERE o_orderdate BETWEEN '1998-07-01' AND '1998-07-31'
  • This statement writes 19M rows to the stage table.
  • Next we swap our updated partition with the original partition which is a metadata operation and no data is actually moved.
  • ALTER TABLE orders_part_switch_temp SWITCH PARTITION 80 TO orders_part_switch partition 80 WITH (TRUNCATE_TARGET = ON)
  • Then finally drop our stage table
  • DROP TABLE orders_part_switch_temp
  • This batch runs in about 11 seconds, half of the time as the UPDATE statement.

    • Summary:
  • Due to CCI overhead and transaction log usage, CTAS w/partition switching may be more performant than using an UPDATE statement. Since the CCI is per partition you also will not decrement the CCI health like an UPDATE statement would.
  • UPDATE Statement: 24 Seconds
  • CTAS & Partition Switch: 12 seconds

    • Gotchas
  • The schema needs to be exactly the same between the source and staging table
  • You must utilize partitioning and the updates be limited to one or few partitions. The more partitions that are updated the less advantageous this approach will be.
  • More Info