Parquet format in Azure Data Factory

Follow this article when you want to parse the Parquet files or write the data into Parquet format.

Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.

Dataset properties

For a full list of sections and properties available for defining datasets, see the Datasets article. This section provides a list of properties supported by the Parquet dataset.

Property Description Required
type The type property of the dataset must be set to Parquet. Yes
location Location settings of the file(s). Each file-based connector has its own location type and supported properties under location. See details in connector article -> Dataset properties section. Yes
compressionCodec The compression codec to use when writing to Parquet files. When reading from Parquet files, Data Factory automatically determine the compression codec based on the file metadata.
Supported types are “none”, “gzip”, “snappy” (default), and "lzo". Note currently Copy activity doesn't support LZO.
No

Note

White space in column name is not supported for Parquet files.

Below is an example of Parquet dataset on Azure Blob Storage:

{
    "name": "ParquetDataset",
    "properties": {
        "type": "Parquet",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "compressionCodec": "snappy"
        }
    }
}

Copy activity properties

For a full list of sections and properties available for defining activities, see the Pipelines article. This section provides a list of properties supported by the Parquet source and sink.

Parquet as source

The following properties are supported in the copy activity *source* section.

Property Description Required
type The type property of the copy activity source must be set to ParquetSource. Yes
storeSettings A group of properties on how to read data from a data store. Each file-based connector has its own supported read settings under storeSettings. See details in connector article -> Copy activity properties section. No

Parquet as sink

The following properties are supported in the copy activity *sink* section.

Property Description Required
type The type property of the copy activity source must be set to ParquetSink. Yes
storeSettings A group of properties on how to write data to a data store. Each file-based connector has its own supported write settings under storeSettings. See details in connector article -> Copy activity properties section. No

Mapping data flow properties

Learn details from source transformation and sink transformation in mapping data flow.

Data type support

Parquet complex data types are currently not supported (e.g. MAP, LIST, STRUCT).

Using Self-hosted Integration Runtime

Important

For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine. See the following paragraph with more details.

For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.

  • To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
  • To use OpenJDK: it's supported since IR version 3.13. Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.

Tip

If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.

Set JVM heap size on Self-hosted IR

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g. The flag Xms specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool. This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. By default, ADF use min 64MB and max 1G.

Next steps