Move data From ODBC data stores using Azure Data Factory

This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises ODBC data store. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.

You can copy data from an ODBC data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from an ODBC data store to other data stores, but not for moving data from other data stores to an ODBC data store.

Enabling connectivity

Data Factory service supports connecting to on-premises ODBC sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to an ODBC data store even if it is hosted in an Azure IaaS VM.

You can install the gateway on the same on-premises machine or the Azure VM as the ODBC data store. However, we recommend that you install the gateway on a separate machine/Azure IaaS VM to avoid resource contention and for better performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the ODBC data store.

Apart from the Data Management Gateway, you also need to install the ODBC driver for the data store on the gateway machine.

Note

See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.

Getting started

You can create a pipeline with a copy activity that moves data from an ODBC data store by using different tools/APIs.

The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.

You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.

Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:

  1. Create linked services to link input and output data stores to your data factory.
  2. Create datasets to represent input and output data for the copy operation.
  3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.

When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an ODBC data store, see JSON example: Copy data from ODBC data store to Azure Blob section of this article.

The following sections provide details about JSON properties that are used to define Data Factory entities specific to ODBC data store:

Linked service properties

The following table provides description for JSON elements specific to ODBC linked service.

Property Description Required
type The type property must be set to: OnPremisesOdbc Yes
connectionString The non-access credential portion of the connection string and an optional encrypted credential. See examples in the following sections. Yes
credential The access credential portion of the connection string specified in driver-specific property-value format. Example: “Uid=;Pwd=;RefreshToken=;”. No
authenticationType Type of authentication used to connect to the ODBC data store. Possible values are: Anonymous and Basic. Yes
username Specify user name if you are using Basic authentication. No
password Specify password for the user account you specified for the username. No
gatewayName Name of the gateway that the Data Factory service should use to connect to the ODBC data store. Yes

Using Basic authentication

{
    "name": "odbc",
    "properties":
    {
        "type": "OnPremisesOdbc",
        "typeProperties":
        {
            "authenticationType": "Basic",
            "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;",
            "userName": "username",
            "password": "password",
            "gatewayName": "mygateway"
        }
    }
}

Using Basic authentication with encrypted credentials

You can encrypt the credentials using the New-AzureRMDataFactoryEncryptValue (1.0 version of Azure PowerShell) cmdlet or New-AzureDataFactoryEncryptValue (0.9 or earlier version of the Azure PowerShell).

{
    "name": "odbc",
    "properties":
    {
        "type": "OnPremisesOdbc",
        "typeProperties":
        {
            "authenticationType": "Basic",
            "connectionString": "Driver={SQL Server};Server=myserver.database.windows.net; Database=TestDatabase;;EncryptedCredential=eyJDb25uZWN0...........................",
            "gatewayName": "mygateway"
        }
    }
}

Using Anonymous authentication

{
    "name": "odbc",
    "properties":
    {
        "type": "OnPremisesOdbc",
        "typeProperties":
        {
            "authenticationType": "Anonymous",
            "connectionString": "Driver={SQL Server};Server={servername}.database.windows.net; Database=TestDatabase;",
            "credential": "UID={uid};PWD={pwd}",
            "gatewayName": "mygateway"
        }
    }
}

Dataset properties

For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).

The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes ODBC dataset) has the following properties

Property Description Required
tableName Name of the table in the ODBC data store. Yes

Copy activity properties

For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities.

Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.

In copy activity, when source is of type RelationalSource (which includes ODBC), the following properties are available in typeProperties section:

Property Description Allowed values Required
query Use the custom query to read data. SQL query string. For example: select * from MyTable. Yes

JSON example: Copy data from ODBC data store to Azure Blob

This example provides JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. It shows how to copy data from an ODBC source to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory.

The sample has the following data factory entities:

  1. A linked service of type OnPremisesOdbc.
  2. A linked service of type AzureStorage.
  3. An input dataset of type RelationalTable.
  4. An output dataset of type AzureBlob.
  5. A pipeline with Copy Activity that uses RelationalSource and BlobSink.

The sample copies data from a query result in an ODBC data store to a blob every hour. The JSON properties used in these samples are described in sections following the samples.

As a first step, set up the data management gateway. The instructions are in the moving data between on-premises locations and cloud article.

ODBC linked service This example uses the Basic authentication. See ODBC linked service section for different types of authentication you can use.

{
    "name": "OnPremOdbcLinkedService",
    "properties":
    {
        "type": "OnPremisesOdbc",
        "typeProperties":
        {
            "authenticationType": "Basic",
            "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;",
            "userName": "username",
            "password": "password",
            "gatewayName": "mygateway"
        }
    }
}

Azure Storage linked service

{
    "name": "AzureStorageLinkedService",
    "properties": {
    "type": "AzureStorage",
    "typeProperties": {
        "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
    }
    }
}

ODBC input dataset

The sample assumes you have created a table “MyTable” in an ODBC database and it contains a column called “timestampcolumn” for time series data.

Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.

{
    "name": "ODBCDataSet",
    "properties": {
        "published": false,
        "type": "RelationalTable",
        "linkedServiceName": "OnPremOdbcLinkedService",
        "typeProperties": {},
        "availability": {
            "frequency": "Hour",
            "interval": 1
        },
        "external": true,
        "policy": {
            "externalData": {
                "retryInterval": "00:01:00",
                "retryTimeout": "00:10:00",
                "maximumRetry": 3
            }
        }
    }
}

Azure Blob output dataset

Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.

{
    "name": "AzureBlobOdbcDataSet",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": "AzureStorageLinkedService",
        "typeProperties": {
            "folderPath": "mycontainer/odbc/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
            "format": {
                "type": "TextFormat",
                "rowDelimiter": "\n",
                "columnDelimiter": "\t"
            },
            "partitionedBy": [
                {
                    "name": "Year",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "yyyy"
                    }
                },
                {
                    "name": "Month",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "MM"
                    }
                },
                {
                    "name": "Day",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "dd"
                    }
                },
                {
                    "name": "Hour",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "HH"
                    }
                }
            ]
        },
        "availability": {
            "frequency": "Hour",
            "interval": 1
        }
    }
}

Copy activity in a pipeline with ODBC source (RelationalSource) and Blob sink (BlobSink)

The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.

{
    "name": "CopyODBCToBlob",
    "properties": {
        "description": "pipeline for copy activity",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "RelationalSource",
                        "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
                    },
                    "sink": {
                        "type": "BlobSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "OdbcDataSet"
                    }
                ],
                "outputs": [
                    {
                        "name": "AzureBlobOdbcDataSet"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1
                },
                "scheduler": {
                    "frequency": "Hour",
                    "interval": 1
                },
                "name": "OdbcToBlob"
            }
        ],
        "start": "2016-06-01T18:00:00Z",
        "end": "2016-06-01T19:00:00Z"
    }
}

Type mapping for ODBC

As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach:

  1. Convert from native source types to .NET type
  2. Convert from .NET type to native sink type

When moving data from ODBC data stores, ODBC data types are mapped to .NET types as mentioned in the ODBC Data Type Mappings topic.

Map source to sink columns

To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory.

Repeatable read from relational sources

When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources.

GE Historian store

You create an ODBC linked service to link a GE Proficy Historian (now GE Historian) data store to an Azure data factory as shown in the following example:

{
    "name": "HistorianLinkedService",
    "properties":
    {
        "type": "OnPremisesOdbc",
        "typeProperties":
        {
            "connectionString": "DSN=<name of the GE Historian store>",
            "gatewayName": "<gateway name>",
            "authenticationType": "Basic",
            "userName": "<user name>",
            "password": "<password>"
        }
    }
}

Install Data Management Gateway on an on-premises machine and register the gateway with the portal. The gateway installed on your on-premises computer uses the ODBC driver for GE Historian to connect to the GE Historian data store. Therefore, install the driver if it is not already installed on the gateway machine. See Enabling connectivity section for details.

Before you use the GE Historian store in a Data Factory solution, verify whether the gateway can connect to the data store using instructions in the next section.

Read the article from the beginning for a detailed overview of using ODBC data stores as source data stores in a copy operation.

Troubleshoot connectivity issues

To troubleshoot connection issues, use the Diagnostics tab of Data Management Gateway Configuration Manager.

  1. Launch Data Management Gateway Configuration Manager. You can either run "C:\Program Files\Microsoft Data Management Gateway\1.0\Shared\ConfigManager.exe" directly (or) search for Gateway to find a link to Microsoft Data Management Gateway application as shown in the following image.

    Search gateway

  2. Switch to the Diagnostics tab.

    Gateway diagnostics

  3. Select the type of data store (linked service).
  4. Specify authentication and enter credentials (or) enter connection string that is used to connect to the data store.
  5. Click Test connection to test the connection to the data store.

Performance and Tuning

See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.