question

MarcFauser-5070 avatar image
0 Votes"
MarcFauser-5070 asked ErlandSommarskog commented

ODBC Driver 17 for SQL Server on Linux. Charset conversion problem with OTRS (bug?)

I updated from ODBC Driver 13 for SQL Server to version 17.
With version 13, everything was working fine, with version 17, I cannot find what's wrong.

We have OTRS running which connects to a SQL Server database to retrieve data for customers.
E.g. we transfer if the customer has a maintenance contract or not.
Yes = thumb up emoji which I cannot post in this forum ( 4 byte character )
No = thumb down emoji which I cannot post in this forum ( 4 byte character )
This was working fine until version 17.
The select (simplified) was
SELECT N'<thumb up emoji>' as maintenance
I changed the select to
SELECT cast( N'<thumb up emoji>' as nvarchar(10) ) as maintenance
and it works again.
In my odbc.ini, I have no charset defined. In OTRS I have defined UTF-8 as a source and destination charset.
Any hints on what could be wrong or is it a bug in the driver as it was working before?

sql-server-generalsql-server-transact-sql
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

So what happened when things went wrong?

0 Votes 0 ·
OlafHelper-2800 avatar image
0 Votes"
OlafHelper-2800 answered ErlandSommarskog commented

Yes = thumb up emoji which I cannot post in this forum ( 4 byte character )

4 byte char's are UTF-8, while MS SQL Server mainly supports 2 byte's char's = Unicode.
Which data type do you use to store the text and which SQL Server version are you using?

cast( N'<thumb up emoji>' as nvarchar(10) )

And the type nvarchar here is Unicode, not UTF-8.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

4 byte char's are UTF-8, while MS SQL Server mainly supports 2 byte's char's = Unicode.

No that is not correct. Emojis are in one of the supplementary planes, that is the code point is beyond 65536. These characters are encoded with four bytes both in UTF-8 and UTF-16. In UTF-16, the are encoded in so-called surrogate pairs, where all bytes are in the range D800-DFFF.

SQL Server uses UTF_16 for the nvarchar data type, but it only handles surrogate pairs correctly with a collation with _SC in the name (or version number 140). That does not mean that you cannot store emojis with other collations, but with these collation, SQL Server thinks the four bytes in the surrogate pair are two characters.

0 Votes 0 ·
Yufeishao-msft avatar image
0 Votes"
Yufeishao-msft answered ErlandSommarskog commented

Hi @MarcFauser-5070,

Upon connection, the driver detects the current local of the process it is loaded in, if it used one of the support encoding, the driver uses that encoding for SQLCHAR data, otherwise, it defaults to UTF-8.
Since all process start in the ‘C’locale by default (and cause the driver to default to UTF-8), if an application needs to use one of the encodings, it should use the setlocale function to set the locale appropriately before connecting

In a typical Linux environment where the encoding is UTF-8, users of ODBC Driver 17 upgrading form 13 or 13.1 won’t observe any differences, however, applications that use a non-UTF-8 encoding need to use that encoding for data to/from the driver instead of UTF-8.
https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/programming-guidelines?view=sql-server-ver15

So you should check the character sets/encodings of ODBC 17.

and you should konw, UTF-8 is supported is SQL Server Version 2019, but not previous Versions, they support only ASCII and UniCode

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

and you should konw, UTF-8 is supported is SQL Server Version 2019, but not previous Versions, they support only ASCII and UniCode

Well, UTF-8 is Unicode, just a different encoding than UTF-16 that SQL Server uses for the nvarchar data type.

SQL Server also supports many more character sets than just plain ASCII, for instance CP-932 for Japanese. (If you have a Japanese collation.)

0 Votes 0 ·