Synapse updating the data lake json file?

ever 1 Reputation point
2020-07-16T20:01:09.163+00:00

is it possible to update the json file from synapse? After reading, i need to update the json rows. When i tried to update/delete the json through external table, it's error out with, won't support the DML operation. Kindly advise how to achieve this?

Azure SQL Database
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,352 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,405 questions
{count} votes

2 answers

Sort by: Most helpful
  1. HimanshuSinha-msft 19,381 Reputation points Microsoft Employee
    2020-07-20T20:58:41.833+00:00

    Hello @ever ,

    Thanks for the question and using the forum.

    There are two major reasons why you cannot apply a DML against external tables today:

    1. ADLS Gen2 is HDFS based. HDFS files are considered immutable once you have written them. This is so core to the platform, that processing engines like Spark and Hive have designed new “delta” formats such as Delta (for Parquet) and ACID (for ORC) to enable DML statements on the data. Note that these models are adding a layer to the core formats to enable this. So even if the second reason would not apply, this makes it impossible to just replace some structure or format efficiently inside a JSON file stored in HDFS. At most you can insert, if the insertion can be mapped to just adding a new file (which is not possible for all formats, see reason 2)
    2. Mapping a data format that is not designed to represent relational data, such as JSON, XML or images (to make an extreme example) into a relational representation is often non-reversible.

    Thus, even if we have a connector for mapping arbitrary JSON for example into a table, the mapping of the JSON into the flat rows is not always reversible. Thus, these external tables will disallow updates.

    You may say that your specific JSON file and mapping is reversible, e.g., the JSON is an array of flat structures, but the general framework does not know that and has to assume that the mapping may have been more complicated. For example, just adding a new row into a hierarchical JSON document cannot be done by adding a new file, since you would have to know how to identify the insertion point at a specific hierarchy location.

    The solutions are:

    1. You update the JSON documents using transformations: Read the file, change it and write out a new version of the file.
    2. You use a more appropriate format for the data (e.g., Delta, Parquet, CSV/TSV) where DML may be possible.
    3. Combine 1 and 2 by using the more appropriate format during the change step.

    Thanks & stay safe

    Himanshu

    Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

    1 person found this answer helpful.
    0 comments No comments

  2. Girijesh Pandey 1 Reputation point
    2020-07-29T15:16:33.61+00:00

    Thanks @HimanshuSinha-msft , this is quite informative and yes you are right in saying the fact that ADGLS Gen2 is HDFS based and files are mostly immutable once data is written in them.

    0 comments No comments