Source transformation in mapping data flow
A source transformation configures your data source for the data flow. When designing data flows, your first step will always be configuring a source transformation. To add a source, click on the Add Source box in the data flow canvas.
Every data flow requires at least one source transformation, but you can add as many sources as necessary to complete your data transformations. You can join those sources together with a join, lookup, or a union transformation.
Each source transformation is associated with exactly one Data Factory dataset. The dataset defines the shape and location of the data you want to write to or read from. If using a file-based dataset, you can use wildcards and file lists in your source to work with more than one file at a time.
The first decision you make when creating a source transformation is whether your source information is defined inside a dataset object or within the source transformation. Most formats are only available in one or the other. Please reference the appropriate connector document to learn how to use a specific connector.
When a format is supported for both inline and in a dataset object, there are benefits to both. Dataset objects are reusable entities that can be leveraged in other data flows and activities such as Copy. These are especially useful when using a hardened schema. Datasets are not based in Spark and occasionally you may need to override certain settings or schema projection in the source transformation.
Inline datasets are recommended when using flexible schemas, one-off source instances, or parameterized sources. If your source is heavily parameterized, in-line datasets allow you to not create a "dummy" object. Inline datasets are based in spark and their properties are native to data flow.
To use an inline dataset, select the desired format in the Source type selector. Instead of selecting a source dataset, you select the linked service you wish to connect to.
Supported inline dataset formats
Supported source datasets in mapping data flow
Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all in Azure. Currently the following datasets can be used in a source transformation:
- Azure Blob Storage (JSON, Avro, Text, Parquet)
- Azure Data Lake Storage Gen1 (JSON, Avro, Text, Parquet)
- Azure Data Lake Storage Gen2 (JSON, Avro, Text, Parquet)
- Azure Synapse Analytics
- Azure SQL Database
- Azure CosmosDB
Settings specific to these connectors are located in the Source options tab. Information on these settings are located in the connector documentation.
Azure Data Factory has access to over 90 native connectors. To include data from those other sources in your data flow, use the Copy Activity to load that data into one of the supported staging areas.
Once you have added a source, configure via the Source Settings tab. Here you can pick or create the dataset your source points at. You can also select schema and sampling options for your data.
Test connection: Test whether or not data flow's spark service can successfully connect to the linked service used in your source dataset. Debug mode must be on for this feature to be enabled.
Schema drift: Schema Drift is data factory's ability to natively handle flexible schemas in your data flows without needing to explicitly define column changes.
Check the Allow schema drift box if the source columns will change often. This setting allows all incoming source fields to flow through the transformations to the sink.
Choosing Infer drifted column types will instruct data factory to detect and define data types for each new column discovered. With this feature turned off, all drifted columns will be of type string.
Validate schema: If validate schema is selected, the data flow will fail to run if the incoming source data doesn't match the defined schema of the dataset.
Skip line count: The skip line count field specifies how many lines to ignore at the beginning of the dataset.
Sampling: Enable sampling to limit the number of rows from your source. Use this setting when you test or sample data from your source for debugging purposes.
Multiline rows: Select multiline rows if your source text file contains string values that span multiple rows, i.e. newlines inside a value. This setting is only available in DelimitedText datasets.
To validate your source is configured correctly, turn on debug mode and fetch a data preview. For more information, see Debug mode.
When debug mode is turned on, the row limit configuration in debug settings will overwrite the sampling setting in the source during data preview.
Like schemas in datasets, the projection in a source defines the data columns, types, and formats from the source data. For most dataset types such as SQL and Parquet, the projection in a source is fixed to reflect the schema defined in a dataset. When your source files aren't strongly typed (for example, flat csv files rather than Parquet files), you can define the data types for each field in the source transformation.
If your text file has no defined schema, select Detect data type so that Data Factory will sample and infer the data types. Select Define default format to autodetect the default data formats.
Reset schema resets the projection to what is defined in the referenced dataset.
You can modify the column data types in a down-stream derived-column transformation. Use a select transformation to modify the column names.
The Import Schema button on the Projection tab allows you to use an active debug cluster to create a schema projection. Available in every source type, importing the schema here will override the projection defined in the dataset. The dataset object will not be changed.
This is useful in datasets like Avro and CosmosDB that support complex data structures do not require schema definitions to exist in the dataset.
Optimize the source transformation
On the Optimize tab for the source transformation, you might see a Source partition type. This option is available only when your source is Azure SQL Database. This is because Data Factory tries to make connections parallel to run large queries against your SQL Database source.
You don't have to partition data on your SQL Database source, but partitions are useful for large queries. You can base your partition on a column or a query.
Use a column to partition data
From your source table, select a column to partition on. Also set the number of partitions.
Use a query to partition data
You can choose to partition the connections based on a query. Enter the contents of a WHERE predicate. For example, enter year > 1980.
For more information on optimization within mapping data flow, see the Optimize tab.