Azure Data Factory - Self Hosted IR - Prevent Binary Import

Wolff Michael 1 Reputation point
2020-06-15T10:34:42.05+00:00

Requirement:
Internal Policy Compliance Regulation requires us to prevent copying binary data from the cloud to OnPremise servers through Azure Data Factory.

Problem:
I figured out how to successfully set up Azure Policy to prevent Binary Datasets.
But besides Binary Datasets there are still other ways to move binary data through Azure Data Factory.
(e.g. through binary column mapping into a SQL Server table's binary column)

One idea would be to let the Self-Hosted-Integration-Runtime detect transportion of binary data into OnPrem and prevent it by some sort of rule.
Another idea would be to let Azure Data Factory generally detect the transportion of binary data and prevent it.

Question:
Is there any way to accomplish this with or without Azure Policy ?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,477 questions
{count} votes

2 answers

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,011 Reputation points
    2020-06-24T22:10:26.563+00:00

    The difference between binary dataset and the other datasets ( delimited text, parquet, sql, rest, json ) , is that binary datasets do not attempt to parse the data. They just copy as-is, no mapping, no schema, no datatype. All other datasets try to parse the data so you can then map it to the sink dataset columns.

    If you tried to push a compiled executable (binary) through the other dataset types, Data Factory would throw an error, because it can't parse the data into records.

    Binary dataset is used to transport anything which cannot be parsed into records. (It also works on those which can).

    I expect what you are really trying to stop malware from entering through Data Factory. Please correct me if I am mistaken.

    The only other 'binary' I can think of besides the dataset, is the data type, such as used in SQL. This type of 'binary' is safe unless your database has the ability to execute data as code, or write it to disk for execution.
    @WolffMichael-0000,


  2. DvorakMichal 1 Reputation point
    2021-09-22T13:14:38.65+00:00

    Hello @MartinJaffer-MSFT ,

    I'm a colleague of @Wolff Michael and we've open the topic in our company again.
    You've written: "The difference between binary dataset and the other datasets ( delimited text, parquet, sql, rest, json ) , is that binary datasets do not attempt to parse the data. They just copy as-is, no mapping, no schema, no datatype. All other datasets try to parse the data so you can then map it to the sink dataset columns."

    Unfortunately I've found out that you can use for example two csv datasets and copy binary data from one datalake to another datalake. There is no parsing of the data. Therefore I think it will work the same way also with the SHIR - you can then download binary files using csv dataset into an internal file storage.

    Do you have any idea how to solve this challenge?

    Best regards,
    Michal Dvorak