Overview of tables in SQL Data Warehouse
Getting started with creating tables in SQL Data Warehouse is simple. The basic CREATE TABLE syntax follows the common syntax you are most likely already familiar with from working with other databases. To create a table, you simply need to name your table, name your columns and define data types for each column. If you've create tables in other databases, this should look very familiar to you.
CREATE TABLE Customers (FirstName VARCHAR(25), LastName VARCHAR(25))
The above example creates a table named Customers with two columns, FirstName and LastName. Each column is defined with a data type of VARCHAR(25), which limits the data to 25 characters. These fundamental attributes of a table, as well as others, are mostly the same as other databases. Data types are defined for each column and ensure the integrity of your data. Indexes can be added to improve performance by reducing I/O. Partitioning can be added to improve performance when you need to modify data.
Renaming a SQL Data Warehouse table looks like this:
RENAME OBJECT Customer TO CustomerOrig;
A new fundamental attribute introduced by distributed systems like SQL Data Warehouse is the distribution column. The distribution column is very much what it sounds like. It is the column that determines how to distribute, or divide, your data behind the scenes. When you create a table without specifying the distribution column, the table is automatically distributed using round robin. While round robin tables can be sufficient in some scenarios, defining distribution columns can greatly reduce data movement during queries, thus optimizing performance. In situations where there is a small amount of data in a table, choosing to create the table with the replicate distribution type copies data to each compute node and saves data movement at query execution time. See Distributing a Table to learn more about how to select a distribution column.
Indexing and partitioning tables
As you become more advanced in using SQL Data Warehouse and want to optimize performance, you'll want to learn more about Table Design. To learn more, see the articles on Table Data Types, Distributing a Table, Indexing a Table and Partitioning a Table.
Statistics are an extremely important to getting the best performance out of your SQL Data Warehouse. Since SQL Data Warehouse does not yet automatically create and update statistics for you, like you may have come to expect in Azure SQL Database, reading our article on Statistics might be one of the most important articles you read to ensure that you get the best performance from your queries.
Temporary tables are tables which only exist for the duration of your logon and cannot be seen by other users. Temporary tables can be a good way to prevent others from seeing temporary results and also reduce the need for cleanup. Since temporary tables also utilize local storage, they can offer faster performance for some operations. See the Temporary Table articles for more details about temporary tables.
External tables, also known as Polybase tables, are tables which can be queried from SQL Data Warehouse, but point to data external from SQL Data Warehouse. For example, you can create an external table which points to files on Azure Blob Storage. For more details on how to create and query an external table, see Load data with Polybase.
Unsupported table features
While SQL Data Warehouse contains many of the same table features offered by other databases, there are some features which are not yet supported. Below is a list of some of the table features which are not yet supported.
|Primary key, Foreign keys, Unique and Check Table Constraints|
Table size queries
One simple way to identify space and rows consumed by a table in each of the 60 distributions, is to use DBCC PDW_SHOWSPACEUSED.
However, using DBCC commands can be quite limiting. Dynamic management views (DMVs) will allow you to see much more detail as well as give you much greater control over the query results. Start by creating this view, which will be referred to by many of our examples in this and other articles.
CREATE VIEW dbo.vTableSizes AS WITH base AS ( SELECT GETDATE() AS [execution_time] , DB_NAME() AS [database_name] , s.name AS [schema_name] , t.name AS [table_name] , QUOTENAME(s.name)+'.'+QUOTENAME(t.name) AS [two_part_name] , nt.[name] AS [node_table_name] , ROW_NUMBER() OVER(PARTITION BY nt.[name] ORDER BY (SELECT NULL)) AS [node_table_name_seq] , tp.[distribution_policy_desc] AS [distribution_policy_name] , c.[name] AS [distribution_column] , nt.[distribution_id] AS [distribution_id] , i.[type] AS [index_type] , i.[type_desc] AS [index_type_desc] , nt.[pdw_node_id] AS [pdw_node_id] , pn.[type] AS [pdw_node_type] , pn.[name] AS [pdw_node_name] , di.name AS [dist_name] , di.position AS [dist_position] , nps.[partition_number] AS [partition_nmbr] , nps.[reserved_page_count] AS [reserved_space_page_count] , nps.[reserved_page_count] - nps.[used_page_count] AS [unused_space_page_count] , nps.[in_row_data_page_count] + nps.[row_overflow_used_page_count] + nps.[lob_used_page_count] AS [data_space_page_count] , nps.[reserved_page_count] - (nps.[reserved_page_count] - nps.[used_page_count]) - ([in_row_data_page_count] + [row_overflow_used_page_count]+[lob_used_page_count]) AS [index_space_page_count] , nps.[row_count] AS [row_count] from sys.schemas s INNER JOIN sys.tables t ON s.[schema_id] = t.[schema_id] INNER JOIN sys.indexes i ON t.[object_id] = i.[object_id] AND i.[index_id] <= 1 INNER JOIN sys.pdw_table_distribution_properties tp ON t.[object_id] = tp.[object_id] INNER JOIN sys.pdw_table_mappings tm ON t.[object_id] = tm.[object_id] INNER JOIN sys.pdw_nodes_tables nt ON tm.[physical_name] = nt.[name] INNER JOIN sys.dm_pdw_nodes pn ON nt.[pdw_node_id] = pn.[pdw_node_id] INNER JOIN sys.pdw_distributions di ON nt.[distribution_id] = di.[distribution_id] INNER JOIN sys.dm_pdw_nodes_db_partition_stats nps ON nt.[object_id] = nps.[object_id] AND nt.[pdw_node_id] = nps.[pdw_node_id] AND nt.[distribution_id] = nps.[distribution_id] LEFT OUTER JOIN (select * from sys.pdw_column_distribution_properties where distribution_ordinal = 1) cdp ON t.[object_id] = cdp.[object_id] LEFT OUTER JOIN sys.columns c ON cdp.[object_id] = c.[object_id] AND cdp.[column_id] = c.[column_id] ) , size AS ( SELECT [execution_time] , [database_name] , [schema_name] , [table_name] , [two_part_name] , [node_table_name] , [node_table_name_seq] , [distribution_policy_name] , [distribution_column] , [distribution_id] , [index_type] , [index_type_desc] , [pdw_node_id] , [pdw_node_type] , [pdw_node_name] , [dist_name] , [dist_position] , [partition_nmbr] , [reserved_space_page_count] , [unused_space_page_count] , [data_space_page_count] , [index_space_page_count] , [row_count] , ([reserved_space_page_count] * 8.0) AS [reserved_space_KB] , ([reserved_space_page_count] * 8.0)/1000 AS [reserved_space_MB] , ([reserved_space_page_count] * 8.0)/1000000 AS [reserved_space_GB] , ([reserved_space_page_count] * 8.0)/1000000000 AS [reserved_space_TB] , ([unused_space_page_count] * 8.0) AS [unused_space_KB] , ([unused_space_page_count] * 8.0)/1000 AS [unused_space_MB] , ([unused_space_page_count] * 8.0)/1000000 AS [unused_space_GB] , ([unused_space_page_count] * 8.0)/1000000000 AS [unused_space_TB] , ([data_space_page_count] * 8.0) AS [data_space_KB] , ([data_space_page_count] * 8.0)/1000 AS [data_space_MB] , ([data_space_page_count] * 8.0)/1000000 AS [data_space_GB] , ([data_space_page_count] * 8.0)/1000000000 AS [data_space_TB] , ([index_space_page_count] * 8.0) AS [index_space_KB] , ([index_space_page_count] * 8.0)/1000 AS [index_space_MB] , ([index_space_page_count] * 8.0)/1000000 AS [index_space_GB] , ([index_space_page_count] * 8.0)/1000000000 AS [index_space_TB] FROM base ) SELECT * FROM size ;
Table space summary
This query returns the rows and space by table. It is a great query to see which tables are your largest tables and whether they are round robin, replicated or hash distributed. For hash distributed tables it also shows the distribution column. In most cases your largest tables should be hash distributed with a clustered columnstore index.
SELECT database_name , schema_name , table_name , distribution_policy_name , distribution_column , index_type_desc , COUNT(distinct partition_nmbr) as nbr_partitions , SUM(row_count) as table_row_count , SUM(reserved_space_GB) as table_reserved_space_GB , SUM(data_space_GB) as table_data_space_GB , SUM(index_space_GB) as table_index_space_GB , SUM(unused_space_GB) as table_unused_space_GB FROM dbo.vTableSizes GROUP BY database_name , schema_name , table_name , distribution_policy_name , distribution_column , index_type_desc ORDER BY table_reserved_space_GB desc ;
Table space by distribution type
SELECT distribution_policy_name , SUM(row_count) as table_type_row_count , SUM(reserved_space_GB) as table_type_reserved_space_GB , SUM(data_space_GB) as table_type_data_space_GB , SUM(index_space_GB) as table_type_index_space_GB , SUM(unused_space_GB) as table_type_unused_space_GB FROM dbo.vTableSizes GROUP BY distribution_policy_name ;
Table space by index type
SELECT index_type_desc , SUM(row_count) as table_type_row_count , SUM(reserved_space_GB) as table_type_reserved_space_GB , SUM(data_space_GB) as table_type_data_space_GB , SUM(index_space_GB) as table_type_index_space_GB , SUM(unused_space_GB) as table_type_unused_space_GB FROM dbo.vTableSizes GROUP BY index_type_desc ;
Distribution space summary
SELECT distribution_id , SUM(row_count) as total_node_distribution_row_count , SUM(reserved_space_MB) as total_node_distribution_reserved_space_MB , SUM(data_space_MB) as total_node_distribution_data_space_MB , SUM(index_space_MB) as total_node_distribution_index_space_MB , SUM(unused_space_MB) as total_node_distribution_unused_space_MB FROM dbo.vTableSizes GROUP BY distribution_id ORDER BY distribution_id ;
To learn more, see the articles on Table Data Types, Distributing a Table, Indexing a Table, Partitioning a Table, Maintaining Table Statistics and Temporary Tables. For more about best practices, see SQL Data Warehouse Best Practices.