question

Xin-9239 avatar image
1 Vote"
Xin-9239 asked Xin-9239 commented

Power Query lookup with "contain" condition on large dataset

Hi there,

Just wondering if anyone could assist me on the below exercise:

  1. Table A has only Clean ID (123456789) or (414141)

  2. Table B has column1, complex ID (**123456789*abc5489 or qwe*414141) and column 2 which is result I want to return (e.g. Rain, Sunny etc)

  3. Would like to perform "vlookup" for Table A and search for items in table B column 1 only rows that contain EXACT string in table A and, if in such case, return value in column 2

  4. So effective would see
    ID..........................Results
    123456789.........Rain
    414141.................Sunny

  5. So far have tried:
    a. standard merge - doesn't work as not exact match
    b. merge with fuzzy match - doesnt work even threshold is 0.1, just doesnt return good results
    c. List contain - doesnt work as it will list final table A each row so many times and return wrong values for some very weird reason..
    d. interestingly, xlookup with ""&column A&"" does work .. but the data set 1 is around 1.5million rows and data set 2 is over 10millions rows so cant perform such xlookup

Highly appreciate for any suggestions!!

power-query-not-supported
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @Xin-9239

Given the challenge with >10M rows, if it was me, I would provide more info., i.e.:
- Provide at least 20 different representative [Complex ID]
- Clarify if only the 1st series (or more) of digits should be found/kept (there are 2 in "**123456789*abc5489")
- Indicate where the data is stored (DBRMS, CSV...) as this might open doors you didn't think about
- Mention if PowerBI Desktop (more functionnalities than Power Query) could be an option
...

Hope this makes sense

1 Vote 1 ·

Hi, thanks for your reply.
1. some of the example IDs (bold are the ones that match column A clean ID):
-actual ID after "-", such as 1234OM-123456789
-actual ID before "." such as 123456789.25.1234
-actual ID itself such as 123456789
-actual ID before character such as "**123456789**opn2e2"
I hope this clarifies, but anyway the idea is table 2's complex would have the same text string as column A but just either have more characters or special characters like dash

  1. That's a good Q, I should have be more specific. Generally (99.9%) the case, even we see things like 123456789abc89, we would expect column A text string completely appear in 2nd dataset (i.e. the bit in front of abc). Ideally we wouldnt care if "89" appear as we would only want exact match of table A's 123456789 not part of it.

  2. Data A is excel but can be saved as any format. Table B is tsv actually but 1.8GB large and 12m rows


0 Votes 0 ·

Also PowerBI is an option but eventually we'd like the data to be in Excel for further processing (while I know PowerBI can export table and allow copy and paste but Excell seems one less step).
Given Table A is ALWAYS below 1 million rows, we should be able to load to table and copy and paste away.

Again thanks for your help

0 Votes 0 ·
Lz-3068 avatar image
0 Votes"
Lz-3068 answered Xin-9239 commented

Hi @Xin-9239

Technically the method is still try to cleanse table B to "exact" match with table A, just wondering if there is any function that would perform "contain" search (i.e. sub text string of B matches A). Not sure if that would drag the performance by too much?
Any existing or custom function involving something like "contains" can not work. As an example, if TableA has [Clean ID] 12345 and somewhere in TableB a [Complex ID] is something like "**123456789*abc5489", 12345 would be found in that [Complex ID] and what you want is an EXACT match

Just wondering I can see a fx-rephrase being added. Therefore I assume this is to cleanse the ID in table B before can be looked up?
Not sure where you found fx-rephrase in what I uploaded. The function I created was named fxParseComplexID. If that's what you talk about, then Yes that's what the function does, inside query UnMatchedComplexClean


I think it works well so far but just also thinking there are probably 10,20 different ways how column B ID can be presented therefore shall I just keep adding cleanse methods?
- I told you to upload at least 20 different and representative [Complex ID]. You did not, so I did with the few example you provided
- Sorry but I don't really understand, especially "shall I just keep adding cleanse methods". Please clarify

· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you very much for your reply. Just I was thinking within xlookup or vlookup, there is a way to write the formula like

111555-image.png

which basically allows me to lookup for same string as A1 cell but allow front and after to have some other text string, but didn't seem to find such similar lookup function in PQ? Basically trying to achieve lookup but allow front or after has other characters as long as part of the text string is A1.

Sorry, I didnot want to take too much of your time therefore just more asking (or exploring) for a possible approach like above rather than having to trouble you write up everything :) . I think if the above isnt possible in PQ for 1m column A + 12m column B then possibly we can break column B to like 12 different columns and perform such lookup 12 times in normal excel (just could be really slow and inefficient).

0 Votes 0 ·
image.png (2.0 KiB)

Hi @Xin-9239
I didn't come back on that point where you initially said this was working with XLOOKUP and wilcards. There's obvioulsy something I don't understand about your data. From what I understood re. what you wanted to acheive and with the few [Complex ID] you submitted, for me this can't work

111568-demo.png
in C5 =VLOOKUP("" & A2 & "", C2:D3, 2, FALSE )

Given that you want an exact/perfect match, with the above example "World" should be return. But again I don't have a view on the actual data

Anyway. There isn't (as of today) any PQ function that allows using wildcards as above. Hence my approach with a custom function


0 Votes 0 ·
demo.png (15.8 KiB)

"shall I just keep adding cleanse methods" - I had a look at the fxParseComplexID formula and it seems to cleanse the data depends on how it's arranged? Then I just meant if I wan it to perform data cleansing over 20 possible forms of text string then just keep adding rows to structure different ways to cleanse different data form then?

0 Votes 0 ·

Then I just meant if I wan it to perform data cleansing over 20 possible forms of text string then just keep adding rows to structure different ways to cleanse different data form then?
Again, it's not top clear to me - sorry. From what I seem to understand I would say Yes, but again that's based on what I think I understand

0 Votes 0 ·

Apologies for the delay reply. Yes we did take the approach as in clean the ID in multiple ways!

Thank you so much for your help!

0 Votes 0 ·
Lz-3068 avatar image
0 Votes"
Lz-3068 answered Xin-9239 commented

Hi @Xin-9239

Based on the very few examples of [Complex ID] you provided the query in this workbook does here

Test
I created a CSV of 1 Million of rows to test the performances of the Parsing function and in average the latter did the job in 12s (Intel Core I7 + SSD), including the preceeding Table.ReplaceValue ("*" with "-")

Settings
In the Query Options/CURRENT WORKBOOK
- Unchecked "Allow data previews to download in the background"
- Seelected option "Ignore the Privacy Levels and potentially improve performance"
If by the greatest of chance the approach does +/- what you expect, before applying it in real conditions, ensure you use the above settings

Suggestion
You intially said "data set 1 is around 1.5million rows", later you said "Given Table A is ALWAYS below 1 million rows"... In the event where some [Clean ID] would match more than one record (as in my example with [Clean ID] 123456789) in your TableB you'll likely exceed the max. number of rows of an Excel worksheet ==> Better you load to the Data Model (you can export tables from the Data Model as CSV with DAX Studio if necessary) from where you can setup a flattened Pivot Table that will be the same as the Table returned by the query


· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @Xin-9239. How's your testing goind with the above proposal?

0 Votes 0 ·

Hi there, thanks so much for your help!

Just wondering I can see a fx-rephrase being added. Therefore I assume this is to cleanse the ID in table B before can be looked up? I think it works well so far but just also thinking there are probably 10,20 different ways how column B ID can be presented therefore shall I just keep adding cleanse methods?

Technically the method is still try to cleanse table B to "exact" match with table A, just wondering if there is any function that would perform "contain" search (i.e. sub text string of B matches A). Not sure if that would drag the performance by too much?

0 Votes 0 ·
Shinka-3536 avatar image
1 Vote"
Shinka-3536 answered Xin-9239 commented

Hi @Xin-9239,

As long as CleanID in Table A is always numbers, then I'd suggest trying the following (always in PQ):

1) in Table B add a costum column with the follwing formula: Text.Select([ComplexID], {"0".."9"}). This formula will extract only the numbers, ignoring everything else, into the new column.
2) You should then have [NewColumn1], the the CleanID, and [Column2], the info you want, in the same table.

Hope that works.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks so much. We are trying that as well, the only prob is ID is not always just number so quite annoying dataset we have

0 Votes 0 ·