Best practices for serverless SQL pool in Azure Synapse Analytics
In this article, you'll find a collection of best practices for using serverless SQL pool. Serverless SQL pool is a resource in Azure Synapse Analytics. If you're working with a dedicated SQL pool, see Best practices for dedicated SQL pools for specific guidance.
Serverless SQL pool allows you to query files in your Azure Storage accounts. It doesn't have local storage or ingestion capabilities. All files that the queries target are external to serverless SQL pool. Everything related to reading files from storage might affect query performance.
Some generic guidelines are:
- Make sure your client applications are collocated with serverless SQL pool.
- If you're using client applications outside Azure, make sure you're using serverless SQL pool in a region that's close to your client computer. Client application examples include Power BI Desktop, SQL Server Management Studio, and Azure Data Studio.
- Make sure the storage and serverless SQL pool are in the same region. Storage examples include Azure Data Lake Storage and Azure Cosmos DB.
- Try to optimize storage layout by using partitioning and keeping your files in the range between 100 MB and 10 GB.
- If you're returning a large number of results, make sure you're using SQL Server Management Studio or Azure Data Studio and not Azure Synapse Studio. Azure Synapse Studio is a web tool that isn't designed for large result sets.
- If you're filtering results by string column, try to use a
- Consider caching the results on the client side by using Power BI import mode or Azure Analysis Services, and periodically refresh them. Serverless SQL pools can't provide an interactive experience in Power BI Direct Query mode if you're using complex queries or processing a large amount of data.
Client applications and network connections
Make sure your client application is connected to the closest possible Azure Synapse workspace with the optimal connection.
- Colocate a client application with the Azure Synapse workspace. If you're using applications such as Power BI or Azure Analysis Service, make sure they're in the same region where you placed your Azure Synapse workspace. If needed, create the separate workspaces that are paired with your client applications. Placing a client application and the Azure Synapse workspace in different regions could cause bigger latency and slower streaming of results.
- If you're reading data from your on-premises application, make sure the Azure Synapse workspace is in the region that's close to your location.
- Make sure you don't have network bandwidth issues while reading a large amount of data.
- Don't use Azure Synapse Studio to return a large amount of data. Azure Synapse Studio is a web tool that uses the HTTPS protocol to transfer data. Use Azure Data Studio or SQL Server Management Studio to read a large amount of data.
Storage and content layout
Here are best practices for storage and content layout in serverless SQL pool.
Colocate your storage and serverless SQL pool
To minimize latency, colocate your Azure Storage account or Azure Cosmos DB analytic storage and your serverless SQL pool endpoint. Storage accounts and endpoints provisioned during workspace creation are located in the same region.
For optimal performance, if you access other storage accounts with serverless SQL pool, make sure they're in the same region. If they aren't in the same region, there will be increased latency for the data's network transfer between the remote region and the endpoint's region.
Azure Storage throttling
Multiple applications and services might access your storage account. Storage throttling occurs when the combined IOPS or throughput generated by applications, services, and serverless SQL pool workloads exceeds the limits of the storage account. As a result, you'll experience a significant negative effect on query performance.
When throttling is detected, serverless SQL pool has built-in handling to resolve it. Serverless SQL pool makes requests to storage at a slower pace until throttling is resolved.
For optimal query execution, don't stress the storage account with other workloads during query execution.
Prepare files for querying
If possible, you can prepare files for better performance:
- Convert large CSV and JSON files to Parquet. Parquet is a columnar format. Because it's compressed, its file sizes are smaller than CSV or JSON files that contain the same data. Serverless SQL pool skips the columns and rows that aren't needed in a query if you're reading Parquet files. Serverless SQL pool needs less time and fewer storage requests to read it.
- If a query targets a single large file, you'll benefit from splitting it into multiple smaller files.
- Try to keep your CSV file size between 100 MB and 10 GB.
- It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
- Partition your data by storing partitions to different folders or file names. See Use filename and filepath functions to target specific partitions.
Colocate your Azure Cosmos DB analytical storage and serverless SQL pool
Make sure your Azure Cosmos DB analytical storage is placed in the same region as an Azure Synapse workspace. Cross-region queries might cause huge latencies. Use the region property in the connection string to explicitly specify the region where the analytical store is placed (see Query Azure Cosmos DB by using serverless SQL pool):
'account=<database account name>;database=<database name>;region=<region name>'
Here are best practices for using CSV files in serverless SQL pool.
Use PARSER_VERSION 2.0 to query CSV files
You can use a performance-optimized parser when you query CSV files. For details, see PARSER_VERSION.
Manually create statistics for CSV files
Serverless SQL pool relies on statistics to generate optimal query execution plans. Statistics are automatically created for columns in Parquet files when needed. At this moment, statistics aren't automatically created for columns in CSV files. Create statistics manually for columns that you use in queries, particularly those used in DISTINCT, JOIN, WHERE, ORDER BY, and GROUP BY. Check statistics in serverless SQL pool for details.
Here are best practices for using data types in serverless SQL pool.
Use appropriate data types
The data types you use in your query affect performance and concurrency. You can get better performance if you follow these guidelines:
- Use the smallest data size that can accommodate the largest possible value.
- If the maximum character value length is 30 characters, use a character data type of length 30.
- If all character column values are of a fixed size, use char or nchar. Otherwise, use varchar or nvarchar.
- If the maximum integer column value is 500, use smallint because it's the smallest data type that can accommodate this value. You can find integer data type ranges in this article.
- If possible, use varchar and char instead of nvarchar and nchar.
- Use the varchar type with some UTF8 collation if you're reading data from Parquet, Azure Cosmos DB, Delta Lake, or CSV with UTF-8 encoding.
- Use the varchar type without UTF8 collation if you're reading data from CSV non-Unicode files (for example, ASCII).
- Use the nvarchar type if you're reading data from a CSV UTF-16 file.
- Use integer-based data types if possible. SORT, JOIN, and GROUP BY operations complete faster on integers than on character data.
- If you're using schema inference, check inferred data types and override them explicitly with the smaller types if possible.
Check inferred data types
Schema inference helps you quickly write queries and explore data without knowing file schemas. The cost of this convenience is that inferred data types might be larger than the actual data types. This discrepancy happens when there isn't enough information in the source files to make sure the appropriate data type is used. For example, Parquet files don't contain metadata about maximum character column length. So serverless SQL pool infers it as varchar(8000).
You can use sp_describe_first_results_set to check the resulting data types of your query.
The following example shows how you can optimize inferred data types. This procedure is used to show the inferred data types:
EXEC sp_describe_first_result_set N' SELECT vendor_id, pickup_datetime, passenger_count FROM OPENROWSET( BULK ''https://sqlondemandstorage.blob.core.windows.net/parquet/taxi/*/*/*'', FORMAT=''PARQUET'' ) AS nyc';
Here's the result set:
After you know the inferred data types for the query, you can specify appropriate data types:
SELECT vendorID, tpepPickupDateTime, passengerCount FROM OPENROWSET( BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/puYear=2018/puMonth=*/*.snappy.parquet', FORMAT='PARQUET' ) WITH ( vendorID varchar(4), -- we used length of 4 instead of the inferred 8000 tpepPickupDateTime datetime2, passengerCount int ) AS nyc;
Here are best practices for using queries in serverless SQL pool.
Push wildcards to lower levels in the path
You can use wildcards in your path to query multiple files and folders. Serverless SQL pool lists files in your storage account, starting from the first asterisk (*), by using the storage API. It eliminates files that don't match the specified path. Reducing the initial list of files can improve performance if there are many files that match the specified path up to the first wildcard.
Use filename and filepath functions to target specific partitions
Data is often organized in partitions. You can instruct serverless SQL pool to query particular folders and files. Doing so reduces the number of files and the amount of data the query needs to read and process. An added bonus is that you'll achieve better performance.
Always cast the results of the filepath and filename functions to appropriate data types. If you use character data types, be sure to use the appropriate length.
Functions used for partition elimination, filepath and filename, aren't currently supported for external tables, other than those created automatically for each table created in Apache Spark for Azure Synapse Analytics.
If your stored data isn't partitioned, consider partitioning it. That way you can use these functions to optimize queries that target those files. When you query partitioned Apache Spark for Azure Synapse tables from serverless SQL pool, the query automatically targets only the necessary files.
Use proper collation to utilize predicate pushdown for character columns
Data in a Parquet file is organized in row groups. Serverless SQL pool skips row groups based on the specified predicate in the WHERE clause, which reduces IO. The result is increased query performance.
Predicate pushdown for character columns in Parquet files is supported for Latin1_General_100_BIN2_UTF8 collation only. You can specify collation for a particular column by using a WITH clause. If you don't specify this collation by using a WITH clause, the database collation is used.
Optimize repeating queries
Here are best practices for using CETAS in serverless SQL pool.
Use CETAS to enhance query performance and joins
CETAS is one of the most important features available in serverless SQL pool. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.
You can use CETAS to materialize frequently used parts of queries, like joined reference tables, to a new set of files. You can then join to this single external table instead of repeating common joins in multiple queries.
As CETAS generates Parquet files, statistics are automatically created when the first query targets this external table. The result is improved performance for subsequent queries targeting table generated with CETAS.
Review the troubleshooting article for solutions to common problems. If you're working with a dedicated SQL pool rather than serverless SQL pool, see Best practices for dedicated SQL pools for specific guidance.
Lähetä ja näytä palaute kohteelle