COPY (Transact-SQL) (preview)

APPLIES TO: NoSQL Server NoAzure SQL Database YesAzure Synapse Analytics (SQL DW) NoParallel Data Warehouse

This article explains how to use the COPY statement in Azure SQL Data Warehouse for loading from external storage accounts. The COPY statement provides the most flexibility for high-throughput data ingestion into SQL Data Warehouse.

Note

The COPY statement is currently in public preview.

Syntax

COPY INTO [schema.]table_name
[(Column_list)] 
FROM ‘<external_location>’ [,...n]
WITH  
 ( 
 [FILE_TYPE = {'CSV' | 'PARQUET' | 'ORC'} ]
 [,FILE_FORMAT = EXTERNAL FILE FORMAT OBJECT ]	
 [,CREDENTIAL = (AZURE CREDENTIAL) ]
 [,ERRORFILE = '[http(s)://storageaccount/container]/errorfile_directory[/]] 
 [,ERRORFILE_CREDENTIAL = (AZURE CREDENTIAL) ]
 [,MAXERRORS = max_errors ] 
 [,COMPRESSION = { 'Gzip' | 'DefaultCodec'|’Snappy’}] 
 [,FIELDQUOTE = ‘string_delimiter’] 
 [,FIELDTERMINATOR =  ‘field_terminator’]  
 [,ROWTERMINATOR = ‘row_terminator’]
 [,FIRSTROW = first_row]
 [,DATEFORMAT = ‘date_format’] 
 [,ENCODING = {'UTF8'|'UTF16'}] 
 [,IDENTITY_INSERT = {‘ON’ | ‘OFF’}]
)

Arguments

schema_name
Is optional if the default schema for the user performing the operation is the schema of the specified table. If schema is not specified, and the default schema of the user performing the COPY operation is different from the specified table, COPY will be canceled, and an error message will be returned.

table_name
Is the name of the table to COPY data into. The target table can be a temporary or permanent table and must already exist in the database.

(column_list)
Is an optional list of one or more columns used to map source data fields to target table columns for loading data. column_list must be enclosed in parentheses and delimited by commas. The column list is of the following format:

[(Column_name [Default_value] [Field_number] [,...n])]

  • Column_name - the name of the column in the target table.
  • Default_value - the default value that will replace any NULL value in the input file. Default value applies to all file formats. COPY will attempt to load NULL from the input file when a column is omitted from the column list or when there is an empty input file field.
  • Field_number - the input file field number that will be mapped to the target column name.
  • The field indexing starts at 1.

When a column list is not specified, COPY will map columns based on the source and target ordinality: Input field 1 will go to target column 1, field 2 will go to column 2, etc.

External locations(s)
Is where the files containing the data is staged. Currently Azure Data Lake Storage (ADLS) Gen2 and Azure Blob Storage are supported:

  • External location for Blob Storage: https://.blob.core.windows.net//
  • External location for ADLS Gen2: https://. dfs.core.windows.net//

Note

The blob endpoint is available for ADLS Gen2 and is only for backward compatibility. Use the dfs endpoint for ADLS Gen2 for best performance.

  • Account - The storage account name

  • Container - The blob container name

  • Path - the folder or file path for the data. The location starts from the container. If a folder is specified, COPY will retrieve all files from the folder and all its subfolders. COPY ignores hidden folders and doesn't return files that begin with an underline (_) or a period (.) unless explicitly specified in the path. This behavior is the same even when specifying a path with a wildcard.

Wildcards cards can be included in the path where

  • Wildcard path name matching is case-sensitive
  • Wildcard can be escaped using the backslash character (\)
  • Wildcard expansion is applied recursively. For instance, all CSV files under Customer1 (including subdirectories of Customer1 will be loaded in the following example: ‘Account/Container/Customer1/*.csv’

Note

For best performance, avoid specifying wildcards that would expand over a larger number of files. If possible, list multiple file locations instead of specifying wildcards.

Multiple file locations can only be specified from the same storage account and container via a comma-separated list such as:

  • ‘https://.blob.core.windows.net//’, ​‘https://.blob.core.windows.net//’…

FILE_TYPE = { ‘CSV’ | ‘PARQUET’ | ‘ORC’ }
FILE_TYPE specifies the format of the external data.

  • CSV: Specifies a comma-separated values file compliant to the RFC 4180 standard.
  • PARQUET: Specifies a Parquet format.
  • ORC: Specifies an Optimized Row Columnar (ORC) format.

Note

The file type 'Delimited Text' in Polybase is replaced by the ‘CSV’ file format where the default comma delimiter can be configured via the FIELDTERMINATOR parameter.

FILE_FORMAT = external_file_format_name
FILE_FORMAT applies to Parquet and ORC files only and specifies the name of the external file format object that stores the file type and compression method for the external data. To create an external file format, use CREATE EXTERNAL FILE FORMAT.

CREDENTIAL (IDENTITY = ‘’, SECRET = ‘’)
CREDENTIAL specifies the authentication mechanism to access the external storage account. Authentication methods are:

CSV Parquet ORC
Azure blob storage SAS/MSI/SERVICE PRINCIPAL/KEY/AAD SAS/KEY SAS/KEY
Azure Data Lake Gen2 SAS/MSI/SERVICE PRINCIPAL/KEY/AAD SAS/MSI/SERVICE PRINCIPAL/KEY/AAD SAS/MSI/SERVICE PRINCIPAL/KEY/AAD

When authenticating using AAD or to a public storage account, CREDENTIAL does not need to be specified.

  • Authenticating with Shared Access Signatures (SAS) IDENTITY: A constant with a value of ‘Shared Access Signature’ SECRET: The shared access signature provides delegated access to resources in your storage account. Minimum permissions required: READ and LIST

  • Authenticating with Service Principals

    IDENTITY: @<OAuth_2.0_Token_EndPoint> SECRET: AAD Application Service Principal key Minimum RBAC roles required: Storage blob data contributor, Storage blob data contributor, Storage blob data owner, or Storage blob data reader

    Note

    Use the OAuth 2.0 token endpoint V1

  • Authenticating with Storage account key IDENTITY: A constant with a value of ‘Storage Account Key’ SECRET: Storage account key

  • Authenticating with Managed Identity (VNet Service Endpoints) IDENTITY: A constant with a value of ‘Managed Identity’ Minimum RBAC roles required: Storage blob data contributor, Storage blob data owner, or Storage blob data reader for the AAD registered SQL Database server

  • Authenticating with an AAD user CREDENTIAL is not required Minimum RBAC roles required: Storage blob data contributor, Storage blob data owner, or Storage blob data reader for the AAD user

ERRORFILE = Directory Location
ERRORFILE only applies to CSV. Specifies the directory within the COPY statement where the rejected rows and the corresponding error file should be written. The full path from the storage account can be specified or the path relative to the container can be specified. If the specified path doesn't exist, one will be created on your behalf. A child directory is created with the name "rejectedrows". The "" character ensures that the directory is escaped for other data processing unless explicitly named in the location parameter.

Within this directory, there's a folder created based on the time of load submission in the format YearMonthDay -HourMinuteSecond (Ex. 20180330-173205). In this folder, two types of files are written, the reason (Error) file and the data (Row) file each pre-appending with the queryID, distributionID, and a file guid. Because the data and the reason are in separate files, corresponding files have a matching prefix.

If ERRORFILE has the full path of the storage account defined, then the ERRORFILE_CREDENTIAL will be used to connect to that storage. Otherwise, the value mentioned for CREDENTIAL will be used.

ERRORFILE_CREDENTIAL = (IDENTITY= ‘’, SECRET = ‘’)
ERRORFILE_CREDENTIAL only applies to CSV files. Supported data source and authentication methods are:

  • Azure Blob Storage - SAS/SERVICE PRINCIPAL/KEY/AAD

  • Azure Data Lake Gen2 - SAS/MSI/SERVICE PRINCIPAL/KEY/AAD

  • Authenticating with Shared Access Signatures (SAS)

    • IDENTITY: A constant with a value of ‘Shared Access Signature’
    • SECRET: The shared access signature provides delegated access to resources in your storage account.
    • Minimum permissions required: READ, LIST, WRITE, CREATE, DELETE
  • Authenticating with Service Principals

    • IDENTITY: @<OAuth_2.0_Token_EndPoint>
    • SECRET: AAD Application Service Principal key
    • Minimum RBAC roles required: Storage blob data contributor or Storage blob data owner

Note

Use the OAuth 2.0 token endpoint V1

  • Authenticating with Storage account key

    • IDENTITY: A constant with a value of ‘Storage Account Key’
    • SECRET: Storage account key
  • Authenticating with Managed Identity (VNet Service Endpoints)

    • IDENTITY: A constant with a value of ‘Managed Identity’
    • Minimum RBAC roles required: Storage blob data contributor or Storage blob data owner for the AAD registered SQL Database server
  • Authenticating with an AAD user

    • CREDENTIAL is not required
    • Minimum RBAC roles required: Storage blob data contributor or Storage blob data owner for the AAD user

Note

If you are using the same storage account for your ERRORFILE and specifying the ERRORFILE path relative to the root of the container, you do not need to specify the ERROR_CREDENTIAL.

MAXERRORS = max_errors
MAXERRORS specifies the maximum number of reject rows allowed in the load before the COPY operation is canceled. Each row that cannot be imported by the COPY operation is ignored and counted as one error. If max_errors is not specified, the default is 0.

COMPRESSION = { 'DefaultCodec '| ’Snappy’ | ‘GZIP’ | ‘NONE’}
COMPRESSION is optional and specifies the data compression method for the external data.

  • CSV supports GZIP
  • Parquet supports GZIP and Snappy
  • ORC supports DefaultCodec and Snappy.
  • Zlib is the default compression for ORC

The COPY command will autodetect the compression type based on the file extension when this parameter is not specified:

  • .gz - GZIP
  • .snappy – Snappy
  • .deflate - DefaultCodec (Parquet and ORC only)

FIELDQUOTE = 'field_quote'
FIELDQUOTE applies to CSV and specifies a single character that will be used as the quote character (string delimiter) in the CSV file. If not specified, the quote character (") will be used as the quote character as defined in the RFC 4180 standard. Extended ASCII characters are not supported with UTF-8 for FIELDQUOTE.

Note

FIELDQUOTE characters are escaped in string columns where there is a presence of a double FIELDQUOTE (delimiter).

FIELDTERMINATOR = 'field_terminator’
FIELDTERMINATOR Only applies to CSV. Specifies the field terminator that will be used in the CSV file. The field terminator can be specified using hexadecimal notation. The field terminator can be multi-character. The default field terminator is a (,). For more information, see Specify Field and Row Terminators (SQL Server).

ROW TERMINATOR = 'row_terminator'
ROW TERMINATOR Only applies to CSV. Specifies the row terminator that will be used in the CSV file. The row terminator can be specified using hexadecimal notation. The row terminator can be multi-character. By default, the row terminator is \r\n.

The COPY command prefixes the \r character when specifying \n (newline) resulting in \r\n. To specify only the \n character, use hexadecimal notation (0x0A). When specifying multi-character row terminators in hexadecimal, do not specify 0x between each character.

Please review the following documentation for additional guidance on specifying row terminators.

FIRSTROW = First_row_int
FIRSTROW applies to CSV and specifies the row number that is read first in all files for the COPY command. Values start from 1, which is the default value. If the value is set to two, the first row in every file (header row) is skipped when the data is loaded. Rows are skipped based on the existence of row terminators.

DATEFORMAT = { ‘mdy’ | ‘dmy’ | ‘ymd’ | ‘ydm’ | ‘myd’ | ‘dym’ }
DATEFORMAT only applies to CSV and specifies the date format of the date mapping to SQL Server date formats. For an overview of all Transact-SQL date and time data types and functions, see Date and Time Data Types and Functions (Transact-SQL). DATEFORMAT within the COPY command takes precedence over DATEFORMAT configured at the session level.

ENCODING = ‘UTF8’ | ‘UTF16’
ENCODING only applies to CSV. Default is UTF8. Specifies the data encoding standard for the files loaded by the COPY command.

IDENTITY_INSERT = ‘ON’ | ‘OFF’
IDENTITY_INSERT specifies whether the identity value or values in the imported data file are to be used for the identity column. If IDENTITY_INSERT is OFF (default), the identity values for this column are verified, but not imported. SQL DW will automatically assign unique values based on the seed and increment values specified during table creation. Note the following behavior with the COPY command:

  • If IDENTITY_INSERT is OFF, and table has an identity column
    • A column list must be specified which does not map an input field to the identity column.
  • If IDENTITY_INSERT is ON, and table has an identity column
    • If a column list is passed, it must map an input field to the identity column.
  • Default value is not supported for the IDENTITY COLUMN in the column list.
  • IDENTITY_INSERT can only be set for one table at a time.

Permissions

The user executing the Copy Command must have the following permissions:

Requires INSERT and ADMINISTER BULK OPERATIONS permissions. In Azure SQL Data Warehouse, INSERT, and ADMINISTER DATABASE BULK OPERATIONS permissions are required.

Examples

A. Load from a public storage account

The following example is the simplest form of the COPY command, which loads data from a public storage account. For this example, the COPY statement's defaults match the format of the line item csv file.

COPY INTO dbo.[lineitem] FROM 'https://unsecureaccount.blob.core.windows.net/customerdatasets/folder1/lineitem.csv'

The default values of the COPY command are:

  • DATEFORMAT = Session DATEFORMAT

  • MAXERRORS = 0

  • COMPRESSION default is uncompressed

  • FIELDQUOTE = “”

  • FIELDTERMINATOR = “,”

  • ROWTERMINATOR = ‘\n'

Important

COPY treats ‘\n’ as ‘\r\n’ internally. For more information, see the ROWTERMINATOR section.

  • FIRSTROW = 1

  • ENCODING = ‘UTF8’

  • FILE_TYPE = ‘CSV’

  • IDENTITY_INSERT = ‘OFF’

B. Load authenticating via Share Access Signature (SAS)

The following example loads files that use the line feed as a row terminator such as a UNIX output. This example also uses a SAS key to authenticate to Azure blob storage.

COPY INTO test_1
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/'
WITH (
    FILE_TYPE = 'CSV',
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>'),
	--CREDENTIAL should look something like this:
    --CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='?sv=2018-03-28&ss=bfqt&srt=sco&sp=rl&st=2016-10-17T20%3A14%3A55Z&se=2021-10-18T20%3A19%3A00Z&sig=IEoOdmeYnE9%2FKiJDSHFSYsz4AkNa%2F%2BTx61FuQ%2FfKHefqoBE%3D'),
    FIELDQUOTE = '"',
    FIELDTERMINATOR=';',
    ROWTERMINATOR='0X0A',
    ENCODING = 'UTF8',
    DATEFORMAT = 'ymd',
	MAXERRORS = 10,
	ERRORFILE = '/errorsfolder/',--path starting from the storage container
	IDENTITY_INSERT = 'ON'
)

C. Load with a column list with default values authenticating via Storage Account Key

This example loads files specifying a column list with default values.

--Note when specifying the column list, input field numbers start from 1
COPY INTO test_1 (Col_one default 'myStringDefault' 1, Col_two default 1 3)
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/'
WITH (
    FILE_TYPE = 'CSV',
    CREDENTIAL=(IDENTITY= 'Storage Account Key', SECRET='<Your_Account_Key>'),
	--CREDENTIAL should look something like this:
    --CREDENTIAL=(IDENTITY= 'Storage Account Key', SECRET='x6RWv4It5F2msnjelv3H4DA80n0PQW0daPdw43jM0nyetx4c6CpDkdj3986DX5AHFMIf/YN4y6kkCnU8lb+Wx0Pj+6MDw=='),
    FIELDQUOTE = '"',
    FIELDTERMINATOR=',',
    ROWTERMINATOR='0x0A',
    ENCODING = 'UTF8',
    FIRSTROW = 2
)

D. Load Parquet or ORC using existing file format object

This example uses a wildcard to load all parquet files under a folder.

COPY INTO test_parquet
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
    FILE_FORMAT = myFileFormat
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)

E. Load specifying wild cards and multiple files

COPY INTO t1
FROM 
'https://myaccount.blob.core.windows.net/myblobcontainer/folder0/*.txt', 
	'https://myaccount.blob.core.windows.net/myblobcontainer/folder1'
WITH ( 
	FILE_TYPE = 'CSV'
	CREDENTIAL=(IDENTITY= '<client_id>@<OAuth_2.0_Token_EndPoint>',SECRET='<key>'),
	FIELDTERMINATOR = '|'
)

FAQ

What is the performance of the COPY command compared to PolyBase?

The COPY command will have better performance by the time the feature is generally available. For best loading performance during public preview, consider splitting your input into multiple files when loading CSV. Currently COPY is on par in terms of performance with PolyBase when using INSERT SELECT.

What is the file splitting guidance for the COPY command loading CSV files?

Guidance on the number of files is outlined in the table below. Once the recommended number of files are reached, you will have better performance the larger the files. You won’t need to split your non-compressed files when the COPY command is generally available.

DWU #Files
100 60
200 60
300 60
400 60
500 60
1,000 120
1,500 180
2,000 240
2,500 300
3,000 360
5,000 600
6,000 720
7,500 900
10,000 1200
15,000 1800
30,000 3600

What is the file splitting guidance for the COPY command loading Parquet or ORC files?

There is no need to split Parquet and ORC files because the COPY command will automatically split files. Parquet and ORC files in the Azure storage account should be 256MB or larger for best performance.

When will the COPY command be generally available?

The COPY command will be generally available by the end of this calendar year (2020).

Are there any known issues with the COPY command?

  • LOB support such as (n)varchar(max) is not available in the COPY statement. This will be available early next year.

Please send any feedback or issues to the following distribution list: sqldwcopypreview@service.microsoft.com

See also

Loading overview with SQL Data Warehouse