question

SQLWhisper-8590 avatar image
0 Votes"
SQLWhisper-8590 asked ·

Split pipe delimited columns in TSQL

what's the best way to spilt the pipe delimited columns in SQL 2017.

Input:

 declare @table table (sno int,sname varchar(200),sname1 varchar(200))
    
 Insert into @Table(sno,sname,sname1)
 Values (1,'Mark|James','Dallas|Houston')

Excepted Output:

1 Mark Dallas
1 James Houston

Thank you.

sql-server-transact-sql
· 2
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @SQLWhisper-8590,

Your post got quite a traction.
Did you have a chance to try at least a couple of proposed solutions you received?

0 Votes 0 ·

Hi @SQLWhisper-8590,

Could you please validate all the answers so far and provide any update?

Please remember to accept the answers if they helped. Your action would be helpful to other users who encounter the same issue and read this thread. 

Thank you for understanding!


Best regards
Melissa

0 Votes 0 ·
pituach avatar image
0 Votes"
pituach answered ·

Good day,

As of SQL Server 2016 you have a built in function named STRING_SPLIT

With that being said, this question implies optionality of a very bad design of your relational database and you should really re-think about your database design or/and the type of the database which you chose to use

· 3 · Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks Pituach for the response. Unfortunately, we receive this data in this format from a vendor. Our hands are tied in regards to format file.

String_split does not produce the excepted results.



 declare @table table (sno int,sname varchar(200),sname1 varchar(200))
    
 Insert into @Table(sno,sname,sname1)
 Values (1,'Mark|James','Dallas|Houston')
    
 Select sno,sname.value as sname,sname1.value as sname1
 from @Table 
      cross apply string_split(sname,'|') as sname
      cross apply string_split(sname1,'|') as sname1


34833-capture.jpg


0 Votes 0 ·
capture.jpg (12.4 KiB)

The limitation of STRING_SPLIT is that that order is not guarantee... We might need to use JSON or XML for the split here (in this case SQLCLR function is probably your best option)

Let's clarify the situation better...

(1) Is the number of names in each value always 2 or can it be more?
(2) Is the number of names always the same in the sname and sname1 colums?
(3) Is the total number of names in the same value guarantee to be less than 4?

can all these might exists for example:

Values (1,'A|B|C','A|B|C')
Values (1,'A|null|C','A|B|C')
Values (1,'A|B|C','A|B')
Values (1,'A|B|C','A|B|')
Values (1,'A|B','A|B|C')
Values (1,'A|B|C|D|E','A|B|C|D|E')




0 Votes 0 ·

Thanks pituach.

  1. Yes, the number values can be more than 2

  2. No, values in sname and sname1 fields are not same.

  3. No. it can be higher..


The number pipe delimited values in sname and sname1 will be same. There can be duplicated values in the pipe delimited string.

Values (1,'A|B|C','A|B|C') -- possible
Values (1,'A|null|C','A|B|C') -- not possible - no nulls are excepted
Values (1,'A|B|C','A|B') --not possible - number delimited values will be same in sname and sname1
Values (1,'A|B|C','A|B|') -- not possible
Values (1,'A|B','A|B|C') -- not possible
Values (1,'A|B|C|D|E','A|B|C|D|E') - possible



0 Votes 0 ·
YitzhakKhabinsky-0887 avatar image
0 Votes"
YitzhakKhabinsky-0887 answered ·

Hi @SQLWhisper-8590,

As @pituach already pointed out, it is better to use XML (or Json) for such cases.

Please try the following solution that is based on XML and XQuery.
It is a two step process:

  1. CTE is composing a proper XML.

  2. SELECT outputs desired relational/rectangular output.

The CTE emits XML structured as follows:

 <root>
   <r name="Mark" city="Dallas" />
   <r name="James" city="Houston" />
 </root>

SQL

 -- DDL and sample data population, start
 DECLARE @table table (sno INT, sname VARCHAR(200), sname1 VARCHAR(200))
 INSERT INTO @Table (sno,sname,sname1) VALUES 
 (1,'Mark|James','Dallas|Houston'),
 (2,'Mary|Katy|Paula','Miami|Fort Lauderdale|Orlando'); -- # of tokens could be dynamic
 -- DDL and sample data population, end
    
 DECLARE @separator CHAR(1) = '|';
    
 ;WITH rs AS
 (
  SELECT sno
  , TRY_CAST('<root><r>' + 
  REPLACE(sname + @separator + sname1, @separator, '</r><r>') + 
  '</r></root>' AS XML).query('<root>
  {
     let $counter := count(/root/r) div 2
     for $x in /root/r[position() le $counter]
     let $pos := count(/root/r[. << $x[1]]) + 1
     return <r name="{$x/text()}" city="{/root/r[$pos + $counter]/text()}"></r>
  }
  </root>') AS xmldata
  FROM @table
 )
 SELECT rs.sno
  , n.value('@name','VARCHAR(30)') AS [name]
  , n.value('@city','VARCHAR(30)') AS [city]
 FROM rs 
  CROSS APPLY xmldata.nodes('/root/r') AS t(n)
 ORDER BY rs.sno;

Output

 +-----+-------+-----------------+
 | sno | name  |      city       |
 +-----+-------+-----------------+
 |   1 | Mark  | Dallas          |
 |   1 | James | Houston         |
 |   2 | Mary  | Miami           |
 |   2 | Katy  | Fort Lauderdale |
 |   2 | Paula | Orlando         |
 +-----+-------+-----------------+
· Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

pituach avatar image
0 Votes"
pituach answered ·

Hi :-)

According to your answer to my clarification question the following should solve your need (please confirm)



First I create table for test with more rows and options than the opn you provided

 DROP TABLE IF EXISTS T1
 GO
 CREATE table T1 (sno int,sname varchar(200),sname1 varchar(200))
 GO
    
 Insert into T1(sno,sname,sname1)
  Values 
  (1,'1A','1a'), (1,'2A|2B','2a|2b'), 
  (1,'3A|3B|3C','3a|3b|3c'), (1,'4A|4B|4C|4D','4a|4b|4c|4d')
 GO
    
 SELECT * FROM T1
 GO

34786-image.png


And here is the solution


 ;With MyCTE0 as (
  select sno, sname,sname1
  ,s = '["' + Replace(sname, '|', '","') + '"]'
  ,s1 = '["' + Replace(sname1, '|', '","') + '"]'
  from T1
 ),
 MyCTE1 as (
  select sno, s, s1
  , k1 = t.[key], v1 = t.[value]
  from MyCTE0
  CROSS APPLY OPENJSON (s, N'$') t
 )
 SELECT sno, v1 , v2
 FROM MyCTE1
 CROSS APPLY (SELECT t1.[key] k2 , t1.[value] v2 FROM OPENJSON (s1, N'$') t1 where t1.[key] = MyCTE1.k1) t
 GO


34757-image.png



image.png (6.2 KiB)
image.png (8.4 KiB)
· Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered ·

Here is a solution that does not use XML or JSON (which both can result in problems if the data includes characters with a special function in these formats):

       declare @Table table (sno int,sname varchar(200),sname1 varchar(200))
            
      Insert into @Table(sno,sname,sname1)
      Values (1,'Mark|James','Dallas|Houston')
    
 SELECT t.sno, s.str + ' ' + s1.str
 FROM   @Table t
 CROSS  APPLY iter_charlist_to_tbl(t.sname, '|') AS s
 CROSS  APPLY iter_charlist_to_tbl(t.sname1, '|') AS s1
 WHERE  s.listpos = s1.listpos

You find the code for iter_charlist_to_tbl here: http://www.sommarskog.se/arrays-in-sql-2005.html#iter-list-of-strings

· 7 · Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

This solution based on loops and even so we cannot always count on the comparing of the relative execution performance I think that in first glance the image says something 9% cost for my solution vs 91% using your functions

Yes, Erland is right, there might be limitations for XML/JSON solutions which should be considered for each case (for example problematic characters which might need to be replaced first).

34737-image.png



0 Votes 0 ·
image.png (18.7 KiB)

That is true, that is not the fast string-split function there is. There are faster ones in the article I linked to. But this is the easiest one to set up, I would suggest that you need a lot of data for the speed of the function to really matter.

And you should know better than comparing the estimated percentages in SSMS. They tell us nothing.

0 Votes 0 ·

34842-image.png

1234567890

0 Votes 0 ·
image.png (13.7 KiB)

Here is a solution that does not use XML or JSON (which both can result in problems if the data includes characters with a special function in these formats

Obviously, it is a legitimate concern. And of course, I am using CDATA section in the T-SQL answer on my computer. Unfortunately, this site doesn't allow their inclusion to the answers. I complained about it many times in the feedback to no avail.

![34810-cdata.jpg][1]


[1]: /answers/storage/temp/34810-cdata.jpg
0 Votes 0 ·
cdata.jpg (29.2 KiB)
Viorel-1 avatar image Viorel-1 YitzhakKhabinsky-0887 ·

The CDATA section helps in case of ‘a<b|c’, but some minor adjustments seem required to make it work in this rare case too: ‘a]]>b|c’.

Although, this is not a subject for present question.



0 Votes 0 ·


Here is a solution that does not use XML or JSON (which both can result in problems if the data includes characters with a special function in these formats) […]

I think that splitting functions can also result in problems if the separator is allowed inside the values using some special rules, like representing ‘|’ as ‘\|’ and ‘\’ as ‘\\’ when ‘|’ is a part of values.

Although, this is not a subject for present question.

0 Votes 0 ·

You are absolutely correct in that regard. However, one will have to assume that the vendor who have specified this format never puts pipes into the data. It they would, their customers would be at loss anyhow.

But when you start to introduce methods such as XML or JSON that the vendor did not have in mind, you are taking a bigger risk.

On the other hand, if would design a format for information interchange, I would absolute go for XML or JSON from the start, exactly because they have methods to deal with meta-characters in the data. And this particular format that SQLWhisper is a victim of is really lousy, since it consists of two strings that must be in sync. It is certainly very brittle.

0 Votes 0 ·
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered ·

Yet a solution. This one is based on a performance test I did, and where I tried the solutions posted by Yitzhak, Ronen and me with a fairly large data set. I gave all ten minutes to run before I killed them. Looking at the plans, Ronen's is probably the one that would have completed first, but in how many hours, I don't know.

While I was waiting, I wrote a new solution which completed in three minutes. With a faster string splitter the time can be reduced further. But the keys is that the two strings need to split separately, so that we can join over an index.

SQLWhisper will have to accept that the names have changed. id = sno, ShipName = sname, ShipAddress = sname1

 CREATE TABLE #t1 (id  int NOT NULL,
                   pos int NOT NULL,
                   ShipName nvarchar(60) NOT NULL,
                   PRIMARY KEY (id, pos))
 CREATE TABLE #t2 (id  int NOT NULL,
                   pos int NOT NULL,
                   ShipAddress nvarchar(60) NOT NULL,
                   PRIMARY KEY (id, pos))
    
 INSERT #t1(id, pos, ShipName)
   SELECT t.id, s.listpos, s.nstr
   FROM   PipeData t
   CROSS  APPLY iter_charlist_to_tbl(t.ShipName, '|') AS s
    
 INSERT #t2(id, pos, ShipAddress)
   SELECT t.id, s.listpos, s.nstr
   FROM   PipeData t
   CROSS  APPLY iter_charlist_to_tbl(t.ShipAddress, '|') AS s
    
 SELECT t1.id, t1.ShipName + ' ' + t2.ShipAddress as str
 FROM   #t1 t1
 JOIN   #t2 t2 ON t1.id = t2.id
              AND t1.pos = t2.pos


For the curious, my test script is here: 34739-pipedatasplt.txt. The BigDB database that I load the data from is on http://www.sommarskog.se/present/BigDB.bak. Warning! This is a 3GB download, and the full database size is 20 GB. SQL 2016 or later is needed. (This is a demo database that I had uploaded already.)




pipedatasplt.txt (5.1 KiB)
· 1 · Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @ErlandSommarskog

The only reason your query is fast is related to the fact that you used temporary table and has nothing to do with your solution (your function).

In such table with such amount of data it is simply better to use two temporary table and then JOIN

I executed you query using your data but since my laptop is poor I in the PipeData table I only filled 10k rows using your query with top 10000

I tested your solution on these 10k rows and it took more than 5 minutes.

I tested my solution using (meaning using OPENJSON) using temporary tables

And guess what ?!?

As I expected my solution finished in 20 seconds!

Assuming I did not made a mistake in my test (and this is the expected result I had) the different is 20 seconds vs 5 minuets

I will post the query using temp table which I used in an answer since it is too long for comment


0 Votes 0 ·
MelissaMa-msft avatar image
0 Votes"
MelissaMa-msft answered ·

Hi @SQLWhisper-8590,

In your situation, CLR function could be the best solution. You could take Adam's CLR function and compile the code to a DLL (using csc), and deployed just that file to the server if you would like to use CLR function.

Besides, XML is also good solution, you could refer other experts' queries.

I also found one good function DelimitedSplit8k which could also work for you.

  declare @table table (sno int,sname varchar(200),sname1 varchar(200))
        
  Insert into @Table(sno,sname,sname1)
  Values 
   (1,'1A','1a'), 
   (2,'2A|2B','2a|2b'), 
   (3,'3A|3B|3C','3a|3b|3c'),
   (4,'4A|4B|4C|4D','4a|4b|4c|4d'),
   (5,'5A|5B|5C|5D|5E','5a|5b|5c|5d|5e')
    
 select a.sno,s.item sname,s1.item sname1 from @Table a
 CROSS APPLY dbo.DelimitedSplit8k(a.sname,'|') s
 CROSS APPLY dbo.DelimitedSplit8k(a.sname1, '|') AS s1
 WHERE  s.itemnumber = s1.itemnumber

Output:

 sno    sname    sname1
 1    1A    1a
 2    2A    2a
 2    2B    2b
 3    3A    3a
 3    3B    3b
 3    3C    3c
 4    4A    4a
 4    4B    4b
 4    4C    4c
 4    4D    4d
 5    5A    5a
 5    5B    5b
 5    5C    5c
 5    5D    5d
 5    5E    5e

Regarding to the performance of multiple methods, you could refer Split strings the right way – or the next best way and choose the best for you.


Best regards
Melissa


If the answer is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

· Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

pituach avatar image
0 Votes"
pituach answered ·

Hi @ErlandSommarskog

The only reason your query is fast is related to the fact that you used temporary table and has nothing to do with your solution (your function).

In such table with such amount of data it is simply better to use two temporary table and then JOIN

I executed your query using your data but since my laptop is poor, in the PipeData table I only filled 10k rows using your query with top 10000

 CREATE DATABASE [ErlandTest]
     CONTAINMENT = NONE
     ON  PRIMARY ( NAME = N'ErlandTest', FILENAME = N'E:\SQL_Files\ErlandTest.mdf' , SIZE = 8192KB , FILEGROWTH = 65536KB )
     LOG ON ( NAME = N'ErlandTest_log', FILENAME = N'E:\SQL_Files\ErlandTest_log.ldf' , SIZE = 8192KB , FILEGROWTH = 65536KB )
 GO
    
 USE [ErlandTest]
 go
    
 CREATE TABLE PipeData (id          int           NOT NULL, 
                        ShipName    nvarchar(MAX) NOT NULL, 
                        ShipAddress nvarchar(MAX) NOT NULL,
                        CONSTRAINT pk_PipeData PRIMARY KEY (id)
 )
 go

 -- THE ONLY DIFFERENT IS THAT I ADDED "top 10000"
 INSERT PipeData(id, ShipName, ShipAddress)
    SELECT top 10000 CustomerID, string_agg(convert(nvarchar(MAX), ShipName), '|') WITHIN GROUP (ORDER BY OrderID), 
                       string_agg(convert(nvarchar(MAX), ShipAddress), '|') WITHIN GROUP (ORDER BY OrderID)
    FROM   BigDB..BigOrders
    GROUP  BY CustomerID
 go



Next, I tested your solution on these 10k rows and it took more than 5 minutes.

I tested my solution (meaning using OPENJSON) using temporary tables, and guess what ?!?

As I expected my solution finished in 20 seconds!

Assuming I did not made a mistake in my test (and this is the expected result I had) the different is 20 seconds vs 5 minuets

So here is my query using temp table

 DROP TABLE IF EXISTS #t1
 DROP TABLE IF EXISTS #t2
 DROP TABLE IF EXISTS #temp4
 go
    
    
 DECLARE @d datetime2 = sysdatetime()
 CREATE TABLE #t1 (id  int NOT NULL,
                   pos int NOT NULL,
                   ShipName nvarchar(60) NOT NULL,
                   PRIMARY KEY (id, pos))
 CREATE TABLE #t2 (id  int NOT NULL,
                   pos int NOT NULL,
                   ShipAddress nvarchar(60) NOT NULL,
                   PRIMARY KEY (id, pos))
    
 INSERT #t1(id, pos, ShipName)
  SELECT t.id,[key] listpos ,[value] nstr
  from PipeData t
  CROSS APPLY OPENJSON ('["' + REPLACE(t.ShipName,'|','","') + '"]') AS s
    
 INSERT #t2(id, pos, ShipAddress)
  SELECT t.id,[key] listpos ,[value] nstr
  from PipeData t
  CROSS APPLY OPENJSON ('["' + REPLACE(t.ShipAddress,'|','","') + '"]') AS s
    
 SELECT t1.id, t1.ShipName, t2.ShipAddress as str
 INTO   #temp4
 FROM   #t1 t1
 JOIN   #t2 t2 ON t1.id = t2.id
              AND t1.pos = t2.pos
    
 SELECT datediff(ms, @d, sysdatetime())
 go -- finished in 20410

· 6 · Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Tested again after close all I could to free resources

Erland solution: 128064.00 ms

Ronen Solution: 15439 ms


Note: First test was 303319 vs 20410

0 Votes 0 ·

he only reason your query is fast is related to the fact that you used temporary table and has nothing to do with your solution (your function).

But that is exactly my point. While the speed of the string-splitter certainly affects the response time, the way how you apply it is a lot more important. That way my slow function could terminate within my patience timeout, while your faster OPENJSON did not.

I did also pointed out that the time could be improved even further with a faster string-splitter. On my system your query with OPENJSON and two temp tables finished in 16 seconds. I also tried the suggestion from Melissa to use Adam Machanic's CLR solution, and it was slightly slower at 18 seconds. On the other hand, it will not lose control if there are JSON characters in the data. (And on the thrid hand, there is a lot more hassle with using the CLR, unless you are already using it.)

0 Votes 0 ·

Hi,

the way how you apply it is a lot more important.

True in most cases, not in others, but required more information from the OP

As I see it, this is outside the scope of the original question in my opinion, or at least something that can be mentioned as another approach (at this time with the information we got). We cannot choose using temp table as general better solution, since we have no information about the real scenario (DDL+DML+Memeory which+CPU). Using temp table will probably not provide better solution in all cases. This is important point! in some cases with different data and amount of data, using the temp table will probably cost more as a result of the overhead of using temp table.

The basic comparation we did at the first is between using XML, using OPENJSON, or using loop function => and OpenJSON is the best for performance here (with the limitations that should be deal with).

--- I hate that we need to split responses ---


0 Votes 0 ·
Show more comments

As I see it, this is outside the scope of the original question in my opinion, or at least something that can be mentioned as another approach (at this time with the information we got). We cannot choose using temp table as general better solution, since we have no information about the real scenario

Yeah, and this is why I went with a simple-minded approach initially. But then somebody started to talk about performance... (But, OK, I got a little carried away there.)

using a dedicate solution in SQLCLR which is designed for our scenario (two columns with the same number of parts to split).

That certainly sounds like a winner. But the CLR also comes with the baggage -unless you are already into it.





0 Votes 0 ·