question

JeffvG-0844 avatar image
0 Votes"
JeffvG-0844 asked MartinJaffer-MSFT answered

SHA256 produces hashes for columns that don't exist

In both cases I am deriving a column like this, within a dataflow:

 sha2(256, byNames($hashcols))

I define hashcols as the parameter

 hashcols = ['FirstName','Date of Birth','Ingest_TS']

The Derive Column step in my dataflow produces hashes, as expected. Whether those are the actual SHA256 hashes seems in question.

That's because, If I change that param to be

 hashcols = ['FirstName','Date of Birth','Ingest_Date']

I still get hashed output, but different hashes - even though "Ingest_Date" doesn't exist as a column.

The same problem obtains if I hash the array directly:

 sha2(256, byNames(['FirstName','Date of Birth','Ingest_Date']))

How can I have any faith that the actual columns are being used to generate the output hash, and not some string or array of strings? The hashes vary across different inputs, so it doesn't seem like the strings are being used as constants each time - but still the problem remains that a non-existent column is used and producing a hash in some way, and that output varies across inputs. How would I paste the 3 values referenced by $hashcols into something like https://passwordsgenerator.net/sha256-hash-generator/ to verify the correctness of the output?








azure-data-factory
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @JeffvG-0844 and welcome to Microsoft Q&A. Please allow me to explain what you are seeing.

First, let me establish what happens when a non-existent column name is used in byName or byNames. When I did

 toString(byName('missing_column'))

the output was NULL

The output of

 sha2(256,['hello'])
    
 sha2(256,['hello',null()])

is different. So referencing a non-existent column is not the same as not referencing a column at all.

0 Votes 0 ·

it makes sense that an additional null() changes the hash because null is two bytes (0xC0, 0x80). A column that doesn't exist is definitively zero bytes and therefore should not change the hash at all. The only additional expected behavior I would argue is acceptable here is failure to even validate, let alone debug or trigger. 'Ingest_Date' or any column that doesn't exist (by name) should error on validation of calls to byName function, and not be allowed to publish, in my opinion.

But since it must stay this way for now, how can I guarantee that the sha256 is generating the correct output? For example, if I have sha2(256,['a','b','c']) I would expect to have an output of
BA7816BF8F01CFEA414140DE5DAE2223B00361A396177A9CB410FF61F20015AD because that is the hash of abc - is this thinking incorrect? I realize there may need to be some toStrings around the values or maybe the array adds something to the hash, as well?

0 Votes 0 ·
Kiran-MSFT avatar image
1 Vote"
Kiran-MSFT answered MartinJaffer-MSFT converted comment to answer

byNames will return null if a column does not exist. nulls do count towards hash computations and are not ignored. It is not a concatenated string representation of all column values but more of a byte representation of those values.

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

So how can I produce a sha256 hash in Data Flows that is reproducible outside of Data Flows?

0 Votes 0 ·

Thank you for confirming that any missing column causes the entire byNames to fail. I thought I saw that happening, but between other issues, I wasn't sure.

0 Votes 0 ·

Does this mean that in order for someone else to recreate the hash even with fields they can see, they may need to add the two null bytes to their call to sha256? I need a hashing function that outputs reproducible hashes for de-identification of healthcare data.

0 Votes 0 ·
MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered

The original problem was that it was trying to work with bad column names, right @JeffvG-0844 ?

How about this expression to scrub out all column names which do not exist in a stream?

 mapIf($badnames, in(columnNames('source1'),#item), #item)

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.