Data can be moved from different sources into your SQL Data Warehouse with a variety tools. ADF Copy, SSIS, and bcp can all be used to achieve this goal. However, as the amount of data increases you should think about breaking down the data migration process into steps. This affords you the opportunity to optimize each step both for performance and for resilience to ensure a smooth data migration.
This article first discusses the simple migration scenarios of ADF Copy, SSIS, and bcp. It then look a little deeper into how the migration can be optimized.
Azure Data Factory (ADF) copy
ADF Copy is part of Azure Data Factory. You can use ADF Copy to export your data to flat files residing on local storage, to remote flat files held in Azure blob storage or directly into SQL Data Warehouse.
If your data starts in flat files, then you will first need to transfer it to Azure storage blob before initiating a load it into SQL Data Warehouse. Once the data is transferred into Azure blob storage you can choose to use ADF Copy again to push the data into SQL Data Warehouse.
PolyBase also provides a high-performance option for loading the data. However, that does mean using two tools instead of one. If you need the best performance then use PolyBase. If you want a single tool experience (and the data is not massive) then ADF is your answer.
Head over to the following article for some great ADF samples.
Integration Services (SSIS) is a powerful and flexible Extract Transform and Load (ETL) tool that supports complex workflows, data transformation, and several data loading options. Use SSIS to simply transfer data to Azure or as part of a broader migration.
SSIS can export to UTF-8 without the byte order mark in the file. To configure this you must first use the derived column component to convert the character data in the data flow to use the 65001 UTF-8 code page. Once the columns have been converted, write the data to the flat file destination adapter ensuring that 65001 has also been selected as the code page for the file.
SSIS connects to SQL Data Warehouse just as it would connect to a SQL Server deployment. However, your connections will need to be using an ADO.NET connection manager. You should also take care to configure the "Use bulk insert when available" setting to maximize throughput. Please refer to the ADO.NET destination adapter article to learn more about this property
Connecting to Azure SQL Data Warehouse by using OLEDB is not supported.
In addition, there is always the possibility that a package might fail due to throttling or network issues. Design packages so they can be resumed at the point of failure, without redoing work that completed before the failure.
For more information consult the SSIS documentation.
bcp is a command-line utility that is designed for flat file data import and export. Some transformation can take place during data export. To perform simple transformations use a query to select and transform the data. Once exported, the flat files can then be loaded directly into the target the SQL Data Warehouse database.
It is often a good idea to encapsulate the transformations used during data export in a view on the source system. This ensures that the logic is retained and the process is repeatable.
Advantages of bcp are:
- Simplicity. bcp commands are simple to build and execute
- Re-startable load process. Once exported the load can be executed any number of times
Limitations of bcp are:
- bcp works with tabulated flat files only. It does not work with files such as xml or JSON
- Data transformation capabilities are limited to the export stage only and are simple in nature
- bcp has not been adapted to be robust when loading data over the internet. Any network instability may cause a load error.
- bcp relies on the schema being present in the target database prior to the load
For more information, see Use bcp to load data into SQL Data Warehouse.
Optimizing data migration
A SQLDW data migration process can be effectively broken down into three discrete steps:
- Export of source data
- Transfer of data to Azure
- Load into the target SQLDW database
Each step can be individually optimized to create a robust, re-startable and resilient migration process that maximizes performance at each step.
Optimizing data load
Looking at these in reverse order for a moment; the fastest way to load data is via PolyBase. Optimizing for a PolyBase load process places prerequisites on the preceding steps so it's best to understand this upfront. They are:
- Encoding of data files
- Format of data files
- Location of data files
PolyBase requires data files to be UTF-8 or UTF-16FE.
Format of data files
PolyBase mandates a fixed row terminator of \n or newline. Your data files must conform to this standard. There aren't any restrictions on string or column terminators.
You will have to define every column in the file as part of your external table in PolyBase. Make sure that all exported columns are required and that the types conform to the required standards.
Please refer back to the [migrate your schema] article for detail on supported data types.
Location of data files
SQL Data Warehouse uses PolyBase to load data from Azure Blob Storage exclusively. Consequently, the data must have been first transferred into blob storage.
Optimizing data transfer
One of the slowest parts of data migration is the transfer of the data to Azure. Not only can network bandwidth be an issue but also network reliability can seriously hamper progress. By default migrating data to Azure is over the internet so the chances of transfer errors occurring are reasonably likely. However, these errors may require data to be re-sent either in whole or in part.
Fortunately you have several options to improve the speed and resilience of this process:
You may want to consider using ExpressRoute to speed up the transfer. ExpressRoute provides you with an established private connection to Azure so the connection does not go over the public internet. This is by no means a mandatory step. However, it will improve throughput when pushing data to Azure from an on-premises or co-location facility.
The benefits of using ExpressRoute are:
- Increased reliability
- Faster network speed
- Lower network latency
- higher network security
ExpressRoute is beneficial for a number of scenarios; not just the migration.
Interested? For more information and pricing please visit the ExpressRoute documentation.
Azure Import and Export Service
The Azure Import and Export Service is a data transfer process designed for large (GB++) to massive (TB++) transfers of data into Azure. It involves writing your data to disks and shipping them to an Azure data center. The disk contents will then be loaded into Azure Storage Blobs on your behalf.
A high-level view of the import export process is as follows:
- Configure an Azure Blob Storage container to receive the data
- Export your data to local storage
- Copy the data to 3.5 inch SATA II/III hard disk drives using the [Azure Import/Export Tool]
- Create an Import Job using the Azure Import and Export Service providing the journal files produced by the [Azure Import/Export Tool]
- Ship the disks your nominated Azure data center
- Your data is transferred to your Azure Blob Storage container
- Load the data into SQLDW using PolyBase
The AZCopy utility is a great tool for getting your data transferred into Azure Storage Blobs. It is designed for small (MB++) to very large (GB++) data transfers. AZCopy has also been designed to provide good resilient throughput when transferring data to Azure and so is a great choice for the data transfer step. Once transferred you can load the data using PolyBase into SQL Data Warehouse. You can also incorporate AZCopy into your SSIS packages using an "Execute Process" task.
To upload a file from your file system you will need a command like the one below:
AzCopy /Source:C:\myfolder /Dest:https://myaccount.blob.core.windows.net/mycontainer /DestKey:key /Pattern:abc.txt
A high-level process summary could be:
- Configure an Azure storage blob container to receive the data
- Export your data to local storage
- AZCopy your data in the Azure Blob Storage container
- Load the data into SQL Data Warehouse using PolyBase
Full documentation available: AZCopy.
Optimizing data export
In addition to ensuring that the export conforms to the requirements laid out by PolyBase you can also seek to optimize the export of the data to improve the process further.
PolyBase can read gzip compressed data. If you are able to compress your data to gzip files then you will minimize the amount of data being pushed over the network.
Breaking up large tables into several files not only helps to improve export speed, it also helps with transfer re-startability, and the overall manageability of the data once in the Azure blob storage. One of the many nice features of PolyBase is that it will read all the files inside a folder and treat it as one table. It is therefore a good idea to isolate the files for each table into its own folder.
PolyBase also supports a feature known as "recursive folder traversal". You can use this feature to further enhance the organization of your exported data to improve your data management.
To learn more about loading data with PolyBase, see Use PolyBase to load data into SQL Data Warehouse.