question

AndreasLiedhegener-1497 avatar image
0 Votes"
AndreasLiedhegener-1497 asked EchoLiu-msft commented

nvarchar and substring

Hello

we have to implement a subset of utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306). If I had a string like

'Hello C̨̆, how are you'

and I make a substring (N'Hello C̨̆, how are you',1,7) I get

Hello C

with substring (N'Hello C̨̆, how are you',1,255) I get

Hello C̨̆, how are you

How could I avoid this problem

Kind regards
Andreas

sql-server-transact-sql
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Do you have anyupdate?
Please also remember to accept the answers if they helped. Your action would be helpful to other users who encounter the same issue and read this thread.

0 Votes 0 ·
cheong00 avatar image
0 Votes"
cheong00 answered cheong00 edited

If you have SQL2019, you may try the following:

 select substring (N'Hello C̨̆, how are you' COLLATE Latin1_General_100_CI_AI_SC,1,7) 


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

AndreasLiedhegener-1497 avatar image
0 Votes"
AndreasLiedhegener-1497 answered

Hello

sorry, I didn't tell it, we have SQL 2017

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Viorel-1 avatar image
0 Votes"
Viorel-1 answered

If it is difficult to solve the problem in SQL, I think that it is possible to write an external function in C#. There are corresponding classes and functions.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

OlafHelper-2800 avatar image
0 Votes"
OlafHelper-2800 answered

utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306)

SQL Server 2017 don't support UTF-8 encoding, only UniCode, that UTF-8 character gets change to two Unicode characters.





5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered ErlandSommarskog commented

No, that is not UTF-8. You have UTF-16. The above is not a valid UTF-8 sequence. And as you can see below, I can convert the string to binary, and the binary to the string to get that character. (Keep in mind that in the binary value we need to swap bytes, because Wintel is a little-endian architecture.

DECLARE @str nvarchar(10) = N'C̨̆'
DECLARE @str2 nvarchar(10) = convert(nvarchar(10), 0x430028030603)
SELECT @str, @str2, len(@str), len(@str2)


And indeed, there are three characters in the string, as you will see if you paste the above into SSMS.

Here in my web browser, there is only one character it seems, but it is a capital C followed by two combining accents. You could of course argue that the combing accents count as part of the length, but SQL Server does not seem to be that smart.

· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

This documentation suggest that on SQL2016 or later, if you use any collations that end with "_SC", it will cater for surrogate characters in UTF-16.

You may check the collation support by the following SQL statement:

 select * from sys.fn_helpcollations() where name like '%_SC'



0 Votes 0 ·

This documentation suggest that on SQL2016 or later, if you use any collations that end with "_SC", it will cater for surrogate characters in UTF-16.

There are no surrogates here. There are three characters, and they are all in the Unicode base plane. Surrogates are in the range D800-DFFFF.

0 Votes 0 ·

Oops, you're right. The latter 2 code points are "diacritical marks".

0 Votes 0 ·

And indeed, there are three characters in the string, as you will see if you paste the above into SSMS.

I realised later that this depends on the font.. In the SSMS I tried, I had Courier, and with this font the character calls apart. It works better with Lucida Console.

I made some tests on Postgres, and it appears like SQL Server. That is finds three characters for the C + the two combining accents, and substring chops the character in the middle of the sequence.

There was an N missing in my example above, which I have corrected.

Finally, here is an illustration that SQL Server does understand the combining accent. @a and @b are considered equal, even if the collation is accent-sensitive - although they have different lengths.

DECLARE @a nvarchar(10) = convert(nvarchar(10), 0x41000003),
        @b nvarchar(10) = 'À'
SELECT @a, @b, IIF(@a = @b, 1, 0), IIF(len(@a) = len(@b), 1, 0)

0 Votes 0 ·