question

AntoineLeca-1299 avatar image
0 Votes"
AntoineLeca-1299 asked AntoineLeca-1299 published

How to split concatenated CSV into several sources

As input I have a (rather large) text file which is the concatenation of several CSV files, each having its own header line then all the data lines. As a rule the first column is the name of the "subCSV"; the first field of the to-be-header line looks like xxxx_CAB while the other left fields for the xxxx sub-table are xxxx_DAT.

A small extract would look like this (in reality there are about 22 sub-tables, some with thousands of lines, and up to 20 fields):

 FACT_CAB    CICLO    FACT_NO    FECHA_EMISION
 FACT_DAT    20200922    CI0916783512    20200922
 RCFE_CAB    FACT_NO    CAT_CONCEP    TIPO_CONCEP
 RCFE_DAT    CI0916783512    Cuotas    Voz
 RCFE_DAT    CI0916783512    Cuotas    Internet y Datos
 RCFE_DAT    CI0916783512    Consumo    Voz

I would like to transform all of this into several "sources" (or tables, or queries.) And I do not know how to start with this in M.

I have read about reading all the lines then some functions like Text.StartsWith()/Text.Split()/Text.Select(), but I see it as inefficient if I should repeat the same stuff 22 times; plus dealing with lacking sub-tables, or a newly added sub-table and the like.
Is there some smarter way to deal with that?

power-query-not-supported
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@AntoineLeca-1299

First important thing to remember/to understand is you cannot get 22 output from a query. 1 query = 1 output, otherwise VBA must be involved. This is true, no doubt at all, when talking about Excel/PQ. I have no idea if PowerBI allows getting several output with a single query in a way or another

In principal there's a smarter way to do things when you need to do the same things :) n times. In a nutshell, you do things once and "convert" your query as a function that you will call as many time as you need

That said, I'm afraid I cannot help you further than that if you cannot upload & share (i.e with OneDrive or any other file sharing service) a small representative sample of your data

Hope this makes sense & helps a bit in the meantime


0 Votes 0 ·

Thanks. Even it is not what I would like to hear, what you are writing makes sense.

Of course I never even thought on writing the same code 22 times; in fact when I am writing something twice I am already thinking where could I "factorize" operations (i.e. writing subprograms if you prefer); and just as in English, I can can consider thrice, but there are no words besides :-)
What I was writing was more about avoiding to have to deal with 22 queries, which will read 22 times the same content, filter it (to different subsets) and then perform similar steps: I believe the costs of all those full-size readings are going to be rather prohibitive.

As you rightly said, looks like VBA is a more obvious choice to perform that task.

0 Votes 0 ·

1 Answer

Lz-3068 avatar image
1 Vote"
Lz-3068 answered AntoineLeca-1299 published

Hi @AntoineLeca-1299

What I was writing was more about avoiding to have to deal with 22 queries, which will read 22 times the same content, filter it (to different subsets) and then perform similar steps: I believe the costs of all those full-size readings are going to be rather prohibitive

Not really. Actually once you get your big CSV in Power Query (I bypass a couple of required steps) you Group that big table by xxx_CAB, yyy_CAB... In your case you end-up with 22 nested table, something like:

116322-demo.png

and load that query as a Connection only - this is a kind of staging query. Then, with VBA (or manually) you access each nested table to load it as a new query/table on a sheet

Hope this clarify things a bit. Now, if this is a one time exercise and you know how to do it with VBA that's probably the best approach as doing what I describe above and then write some VBA to iterate over each nested table or doing it manually will necessarily take more time than a VBA script

Any question or need for clarification let me know


demo.png (8.5 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Now it gets interesting.
In fact, it gets interesting on two counts!

First, because you are actually providing a peek at what I was after, particularly the Table content inside the so-called staging query (I understand what a staging query is and what it is about, but I am not proficient enough about groups, that I need to learn.)

Next and probably more importantly on short term because as you are saying, after that I would have to load the 22 tables on the workbook (or somewhere) in order to be able to access the content of the sub-tables later, with more PowerQueries.
And while I was thinking about it, I realized doing it with VBA is going to be pretty easy: one Workbooks.OpenText with tuned parameters, then going through all the left cells from top to bottom, at each change, selection of the rectangle (using .End(xlToRight) is perfect here) and .ListObjects.Add() to create an Excel “Table” object; rinse and repeat. At the end I have the same result as above: a bunch of Excel areas with the data, ready to be processed with PowerQueries; and I am pretty sure I shall be done with VBA far earlier than going the M way :-)

0 Votes 0 ·