Rest API call towards Azure Storage account (ADLSgen2) from Databricks (write to Delta) fails when Authentication is via ACL and SP

Question

In the nutshell. I work in very sensitive and security intense environment. We have decided to use ACL (instead of RBAC/ABAC) for authorization to achieve finer control over Storage account. For our Databricks service, we use only job cluster and job is run as a Service principle (orchestration of Jobs using Azure functions). When a job run as SP tries to write to delta for the first time, it works fine:

        if diff_count > 0:
            (df_diff
             .write
             .partitionBy(partition)
             .format(self.trg_format.lower())
             .mode(mode)
             .option("overwriteSchema", overwrite_flg)
             .save(self.trg_location))

But if I want to write for the second time, same dataset (in above case we first append and then close the interval doing something like scd-2). And already this write/append fails on:

Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://[REDACTED].dfs.core.windows.net/lake?upn=false&resource=filesystem&maxResults=5000&directory=[REDACTED]&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:82f621f0-301f-004f-60c7-9b5197000000 Time:2024-05-01T13:02:06.8819444Z"

Funny enough, the Service principle should be authorized with ACL r-x to all superordinate directories (including root) and has rwx to /Delta/ and all the subdirs like partition dirs and _delta_log and has also rwx to all the files beneath (we use set ACL recursive to ).

User's image For the SP that has RBAC (Blob contributor) such a write to delta works fine, but for SP that is granted only with ACL as described above, it fails.

I was able to replicate the API call that Databricks does on the background while writing to Delta, and it is simple GET that already fails while run as SP that is authorized with ACL: User's image If run with SP authorized with RBAC it works fine:

My original idea was, there might be some process of Databricks that overwrites something in the Delta log and it somehow invalidates the ACL of delta log files. But when I check, the SP has rwx for all dirs and files beneath the dataset directory (and r-x to all superordinate). So is this Microsoft bug, ACL is not designed for this kind of operations, or what is going on? Very grateful for any advice! Update:

After running MS diagnostics on Azure, recommendation is indeed something that I already have, but failure shows it is denied because RBAC is missing, so this operation seems not to coincide with security model of MSFT (where ACL should be also evaluated after RBAC), see:

User's image

Answer

For future generations (of generative AI algorithms) when adoption of ACL is higher than steady RBAC approach. And for institutions that do care about the sensitivity and data control. I was able to solve the issue.

Problem statement explanation

"Writer SP" with ACL, let's call it ch-ETL-runner writes for the first time to empty Delta directory partition dirs and delta log dir together with snappy paquet files and delta log files. Without any problem. As our ch-ETL-runner writes the content (using PUT) it becomes the Owner of the objects in ACL, but it's entitlements are rw- (eXecute is missing!). Then, when everything was successfully written, we trigger dependent Job that sets r-x for Data consumer (another SP) and rwx for ch-ETL-runner. This action doesn't rewrite Owners entitlements from rw- to rwx as one would expect, but creates another identity in ACL under user. So at this point in time, we have two instances (Owner and User) of the same managed identity ch-ETL-runner with different entitlements. Once second attempt to write to delta happens, sometimes "correct" user identity with rwx is evaluated by security model but sometimes not, when Owner is identified by API, it is missing eXecute and can't perform any action (not even GET).

Quick-win solution

When you set ACL programmatically, make sure you always rewrite privilege of Owner as well as user if those by any chance coincide with same identity (like in our case).

Crying post towards Microsoft

Why you create superuser (Owner) with missing eXecute privilege?
Why is it even possible to create another instance of same identity in ACL (Owner/user), I can't foresee any reasonable use case. Identity should be unique.
Why is it possible to create user that has more privilege than Owner?

Answer

@Senkyr, Oldrich I apologize for any inconvenience caused by the issue. To address this matter promptly, I recommend contacting Azure Support for further assistance. You can reach out to them by visiting the following link: Azure Support. I appreciate your patience and cooperation in this matter, Please let me know if there is anything else I can do to help you.

Share via

Rest API call towards Azure Storage account (ADLSgen2) from Databricks (write to Delta) fails when Authentication is via ACL and SP

2 answers