Get Metadata synapse activity returns wrong data type from parquet file

Question

When the synapse pipeline creates external table we need data types, so currently we're using Get Metadata synapse activity that returns the columns types from the parquet file of the corresponding table.

For the couple of tables the data type returned from Get Metadata activity is not correct: the data in database column is NUMBER, but the Get Metadata activity returns DECIMAL within the parquet file metadata. So then the external table is created with the wrong data type and when the table is queried in the pipeline with the Openrowset it throws an error such as:

Column 'Column name' of type 'BIGINT' is not compatible with external data type 'Parquet physical type: FIXED_LEN_BYTE_ARRAY, logical type: DECIMAL(38, 18)', please try with 'DECIMAL(38, 18)'.

I tried to cast the columns as Number or Integer, but it didn't help.

Could you please give a suggestion for the solution of this problem?

Answer

@Nataliia Peresada

Here are two approaches to address the wrong data types in your Parquet files:

1. Re-write the Parquet Files with Correct Data Types:

If you have control over how the Parquet files are generated, you can rewrite them with the correct data types. This ensures future data ingestion also uses the proper format. Here's how to achieve this depending on your environment:

Spark: Use Spark SQL's cast function while writing the DataFrame to Parquet. Specify the desired data type (e.g., cast(col("column_name") as BIGINT)) before writing.
PySpark: Similar to Spark, use df.withColumn("column_name", col("column_name").cast("bigint")) before writing the DataFrame as Parquet.
Other Tools: Many Parquet writing tools offer options to specify data types during the writing process. Refer to the specific tool's documentation for details.

2. Data Type Conversion During Query:

If rewriting the Parquet files isn't feasible, you can convert the data types during the query itself in Synapse. Here are two options:

ALTER TABLE: This approach modifies the external table definition permanently. However, be cautious if other queries rely on the existing schema. Use ALTER TABLE with SET LOCATION and redefine the schema with the correct data types.

SQL
ALTER TABLE MyExternalTable
SET LOCATION = 'path/to/parquet/files'
WITH (
  DATA_SOURCE = Parquet,
  SCHEMA = (
    col1 BIGINT,
    col2 VARCHAR(255),
    -- Define all columns and their data types
  )
);

CAST Function: This approach converts the data type within the query itself. It offers more flexibility but can impact performance for large datasets:

SELECT CAST(col1 AS BIGINT) AS col1, col2, ...
FROM MyExternalTable;

Hope this helps. Do let us know if you any further queries.

Get Metadata synapse activity returns wrong data type from parquet file

1 answer