I have a Databricks job writing to Synapse, that I'm migrating off to use Polybase so that the writes are more performant.
One of the tables the job is writing to has an IDENTITY column.
Imagine that the table has the following DDL:
CREATE TABLE dbo.dummy_table (
generated_id int IDENTITY(1,1) NOT NULL,
other_column VARCHAR(128) NOT NULL
);
I do not want to write to this column - I want to let Synapse generate the identity. My Databricks job only attempts to write a dataframe that only has "other_column".
The Polybase write, however, fails with "An explicit value for the identity column in table 'dummy_table' can only be specified when a column list is used and IDENTITY_INSERT is ON.".
The code to write goes like this using the Synapse connector:
df.write \
.format("com.databricks.spark.sqldw") \
.mode("append") \
.option("url", "jdbc:sqlserver://host:1433;database=db_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \
.option("dbTable", "dummy_table") \
.option("tempDir",
f"wasbs://container@storageaccount.blob.core.windows.net/tempdir") \
.option("user", "jdbcuser") \
.option("password", "jdbcPassword") \
.option("truncate", "true") \
.option("forwardSparkAzureStorageCredentials", "true") \
.save()
On a closer look, I realized that the COPY INTO query that the connector executes against Synapse actually has an option to specify column list (https://docs.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest#column_list). That option works:
COPY INTO "dummy_table" (other_column 2)
FROM 'wasbs://container@storageaccount.blob.core.windows.net/tempdir/2022-03-29/xxxx/yyyy/'
WITH
( FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Storage Account Key', SECRET = '')
,COMPRESSION = 'Snappy',
MAXERRORS = 0)
OPTION (LABEL = 'Databricks Batch Load; Container Build c03ea50: "dummy_id"');
Questions:
How can I do this with the Databricks Synapse connector?
If it is not possible, how else can I use Polybase/COPY to write to a table that has an IDENTITY column from Databricks?