Azure Data Factory - Security considerations for data movement
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see data movement security considerations for Data Factory.
This article describes basic security infrastructure that data movement services in Azure Data Factory use to secure your data. Azure Data Factory management resources are built on Azure security infrastructure and use all possible security measures offered by Azure.
In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities that together perform a task. These pipelines reside in the region where the data factory was created.
Even though Data Factory is available in only West US, East US, and North Europe regions, the data movement service is available globally in several regions. Data Factory service ensures that data does not leave a geographical area/ region unless you explicitly instruct the service to use an alternate region if the data movement service is not yet deployed to that region.
Azure Data Factory itself does not store any data except for linked service credentials for cloud data stores, which are encrypted using certificates. It lets you create data-driven workflows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms.
Data movement using Azure Data Factory has been certified for:
If you are interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust Center.
In this article, we review security considerations in the following two data movement scenarios:
- Cloud scenario- In this scenario, both your source and destination are publicly accessible through internet. These include managed cloud storage services like Azure Storage, Azure SQL Data Warehouse, Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce, and web protocols such as FTP and OData. You can find a complete list of supported data sources here.
- Hybrid scenario- In this scenario, either your source or destination is behind a firewall or inside an on-premises corporate network or the data store is in a private network/ virtual network (most often the source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario.
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure PowerShell.
Securing data store credentials
Azure Data Factory protects your data store credentials by encrypting them by using certificates managed by Microsoft. These certificates are rotated every two years (which includes renewal of certificate and migration of credentials). These encrypted credentials are securely stored in an Azure Storage managed by Azure Data Factory management services. For more information about Azure Storage security, refer Azure Storage Security Overview.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory and a cloud data store are via secure channel HTTPS or TLS.
All connections to Azure SQL Database and Azure SQL Data Warehouse always require encryption (SSL/TLS) while data is in transit to and from the database. While authoring a pipeline using a JSON editor, add the encryption property and set it to true in the connection string. When you use the Copy Wizard, the wizard sets this property by default. For Azure Storage, you can use HTTPS in the connection string.
Data encryption at rest
Some data stores support encryption of data at rest. We suggest that you enable data encryption mechanism for those data stores.
Azure SQL Data Warehouse
Transparent Data Encryption (TDE) in Azure SQL Data Warehouse helps with protecting against the threat of malicious activity by performing real-time encryption and decryption of your data at rest. This behavior is transparent to the client. For more information, see Secure a database in SQL Data Warehouse.
Azure SQL Database
Azure SQL Database also supports transparent data encryption (TDE), which helps with protecting against the threat of malicious activity by performing real-time encryption and decryption of the data without requiring changes to the application. This behavior is transparent to the client. For more information, see Transparent Data Encryption with Azure SQL Database.
Azure Data Lake Store
Azure Data Lake store also provides encryption for data stored in the account. When enabled, Data Lake store automatically encrypts data before persisting and decrypts before retrieval, making it transparent to the client accessing the data. For more information, see Security in Azure Data Lake Store.
Azure Blob Storage and Azure Table Storage
Azure Blob Storage and Azure Table storage supports Storage Service Encryption (SSE), which automatically encrypts your data before persisting to storage and decrypts before retrieval. For more information, see Azure Storage Service Encryption for Data at Rest.
Amazon S3 supports both client and server encryption of data at Rest. For more information, see Protecting Data Using Encryption. Currently, Data Factory does not support Amazon S3 inside a virtual private cloud (VPC).
Amazon Redshift supports cluster encryption for data at rest. For more information, see Amazon Redshift Database Encryption. Currently, Data Factory does not support Amazon Redshift inside a VPC.
Salesforce supports Shield Platform Encryption that allows encryption of all files, attachments, custom fields. For more information, see Understanding the Web Server OAuth Authentication Flow.
Hybrid Scenarios (using Data Management Gateway)
Hybrid scenarios require Data Management Gateway to be installed in an on-premises network or inside a virtual network (Azure) or a virtual private cloud (Amazon). The gateway must be able to access the local data stores. For more information about the gateway, see Data Management Gateway.
The command channel allows communication between data movement services in Data Factory and Data Management Gateway. The communication contains information related to the activity. The data channel is used for transferring data between on-premises data stores and cloud data stores.
On-premises data store credentials
The credentials for your on-premises data stores are stored locally (not in the cloud). They can be set in three different ways.
- Using plain-text (less secure) via HTTPS from Azure Portal/ Copy Wizard. The credentials are passed in plain-text to the on-premises gateway.
- Using click-once based credentials manager app. The click-once application executes on the on-premises machine that has access to the gateway and sets credentials for the data store. This option and the next one are the most secure options. The credential manager app, by default, uses the port 8050 on the machine with gateway for secure communication.
- Use New-AzDataFactoryEncryptValue PowerShell cmdlet to encrypt credentials. The cmdlet uses the certificate that gateway is configured to use to encrypt the credentials. You can use the encrypted credentials returned by this cmdlet and add it to EncryptedCredential element of the connectionString in the JSON file that you use with the New-AzDataFactoryLinkedService cmdlet or in the JSON snippet in the Data Factory Editor in the portal. This option and the click-once application are the most secure options.
Supported browsers: IE8, IE9, IE10, IE11, Microsoft Edge, and latest Firefox, Chrome, Opera, Safari browsers.
Click-once credentials manager app
You can launch the click-once based credential manager app from Azure portal/Copy Wizard when authoring pipelines. This application ensures that credentials are not transferred in plain text over the wire. By default, it uses the port 8050 on the machine with gateway for secure communication. If necessary, this port can be changed.
Currently, Data Management Gateway uses a single certificate. This certificate is created during the gateway installation (applies to Data Management Gateway created after November 2016 and version 2.4.xxxx.x or later). You can replace this certificate with your own SSL/TLS certificate. This certificate is used by the click-once credential manager application to securely connect to the gateway machine for setting data store credentials. It stores data store credentials securely on-premises by using the Windows DPAPI on the machine with gateway.
Older gateways that were installed before November 2016 or of version 2.3.xxxx.x continue to use credentials encrypted and stored on cloud. Even if you upgrade the gateway to the latest version, the credentials are not migrated to an on-premises machine
|Gateway version (during creation)||Credentials Stored||Credential encryption/ security|
|< = 2.3.xxxx.x||On cloud||Encrypted using certificate (different from the one used by Credential manager app)|
|> = 2.4.xxxx.x||On premises||Secured via DPAPI|
Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during communication with Azure services.
Virtual network is a logical representation of your network in the cloud. You can connect an on-premises network to your Azure virtual network (VNet) by setting up IPSec VPN (site-to-site) or Express Route (Private Peering)
The following table summarizes the network and gateway configuration recommendations based on different combinations of source and destination locations for hybrid data movement.
|Source||Destination||Network configuration||Gateway setup|
|On-premises||Virtual machines and cloud services deployed in virtual networks||IPSec VPN (point-to-site or site-to-site)||Gateway can be installed either on-premises or on an Azure virtual machine (VM) in VNet|
|On-premises||Virtual machines and cloud services deployed in virtual networks||ExpressRoute (Private Peering)||Gateway can be installed either on-premises or on an Azure VM in VNet|
|On-premises||Azure-based services that have a public endpoint||ExpressRoute (Public Peering)||Gateway must be installed on-premises|
The following images show the usage of Data Management Gateway for moving data between an on-premises database and Azure services using Express route and IPSec VPN (with Virtual Network):
Firewall configurations and whitelisting IP address of gateway
Firewall requirements for on-premises/private network
In an enterprise, a corporate firewall runs on the central router of the organization. And, Windows firewall runs as a daemon on the local machine on which the gateway is installed.
The following table provides outbound port and domain requirements for the corporate firewall.
|Domain names||Outbound ports||Description|
||443, 80||Required by the gateway to connect to data movement services in Data Factory|
||443||Used by the gateway to connect to Azure Storage Account when you use the staged copy feature.|
||443||Required by the gateway to connect to the Azure Data Factory service.|
||1433||(OPTIONAL) needed when your destination is Azure SQL Database/ Azure SQL Data Warehouse. Use the staged copy feature to copy data to Azure SQL Database/Azure SQL Data Warehouse without opening the port 1433.|
||443||(OPTIONAL) needed when your destination is Azure Data Lake store|
You may have to manage ports/ whitelisting domains at the corporate firewall level as required by respective data sources. This table only uses Azure SQL Database, Azure SQL Data Warehouse, Azure Data Lake Store as examples.
The following table provides inbound port requirements for the windows firewall.
|8050 (TCP)||Required by the credential manager application to securely set credentials for on-premises data stores on the gateway.|
IP configurations/ whitelisting in data store
Some data stores in the cloud also require whitelisting of IP address of the machine accessing them. Ensure that the IP address of the gateway machine is whitelisted/ configured in firewall appropriately.
The following cloud data stores require whitelisting of IP address of the gateway machine. Some of these data stores, by default, may not require whitelisting of the IP address.
Frequently asked questions
Question: Can the Gateway be shared across different data factories? Answer: We do not support this feature yet. We are actively working on it.
Question: What are the port requirements for the gateway to work? Answer: Gateway makes HTTP-based connections to open internet. The outbound ports 443 and 80 must be opened for gateway to make this connection. Open Inbound Port 8050 only at the machine level (not at corporate firewall level) for Credential Manager application. If Azure SQL Database or Azure SQL Data Warehouse is used as source/ destination, then you need to open 1433 port as well. For more information, see [Firewall configurations and whitelisting IP addresses](#firewall-configurations-and-whitelisting-ip-address-of gateway) section.
Question: What are certificate requirements for Gateway? Answer: Current gateway requires a certificate that is used by the credential manager application for securely setting data store credentials. This certificate is a self-signed certificate created and configured by the gateway setup. You can use your own TLS/ SSL certificate instead. For more information, see click-once credential manager application section.
For information about performance of copy activity, see Copy activity performance and tuning guide.