question

SQLGuy10-3060 avatar image
0 Votes"
SQLGuy10-3060 asked ·

varchar vs nvarchar

as a general rule, should you use varchar instead of nvarchar, if you won't ever be storing foreign language characters?

sql-server-general
· 1
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @SQLGuy10-3060,

We have not received a response from you. Did the replies could help you? If the response helped, do "Accept Answer". By doing so, it will benefit all community members who are having this similar issue. Your contribution is highly appreciated.

Best regards,
Cathy

0 Votes 0 ·
MartinCairney-6481 avatar image
0 Votes"
MartinCairney-6481 answered ·

If you never need to make use of the extended characters available in UNICODE then yes, it does make sense to only use VARCHAR.

However, if you are only using characters in the UNICODE range 0-65535 then they use the same storage space as VARCHAR.

See the discussion about NCHAR and NVARCHAR

Remember also that if you use UNICODE then you MUST always prefix strings with "N" otherwise you will introduce an implicit conversion and also affect the SARGability of your predicates.




· 7 ·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Martin,

I'm confused by your below statement:

"However, if you are only using characters in the UNICODE range 0-65535 then they use the same storage space as VARCHAR."

Above is, AFAIK, true if you use row compression and the data isn't off-row. But without row compression, nvarchar uses double the length compared to varchar:

 USE tempdb
    
 /*
 DROP TABLE IF EXISTS t4
 DROP TABLE IF EXISTS t5
 */
    
 CREATE TABLE t4(c1 int IDENTITY, c2 varchar(40))
 CREATE TABLE t5(c1 int IDENTITY, c2 nvarchar(40))
    
 INSERT INTO t4 SELECT TOP(50000) REPLICATE('a', 40) FROM sys.columns AS a CROSS JOIN sys.columns AS b
 INSERT INTO t5 SELECT TOP(50000) REPLICATE('a', 40) FROM sys.columns AS a CROSS JOIN sys.columns AS b
    
 EXEC sp_spaceused 't4'
 EXEC sp_spaceused 't5'

Did I misunderstand you?

0 Votes 0 ·

Hi @TiborKaraszi - see the link to the Docs page in my post in the Remarks section

"when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair. However, in higher Unicode ranges (65,536-1,114,111) one character may use two byte-pairs."

There isn't mention of compression there, but after reading it again, it may also depend on the collation that you have set. I haven't experimented to see what differences there are.

On the test however, I would use a variables defined as VARCHAR(x) and NVARCHAR(x) and compare the length of that for the same string. I thik for the page storage you would need to dig into DBCC PAGE and see exactly what has been written to validate that test.

0 Votes 0 ·

"when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair.

A byte pair is two bytes per character. In varchar, each character in the ASCII range takes up one byte each. For characters beyond the ASCII range, it depends on the code page for the collation. But for collations for languages using Latin, Cyrillic or Greek script but based on UTF-8, that is one byte for every character.

It is different if you use UTF-8, or a collation for languages like Thai or Chinese.

0 Votes 0 ·
TiborKaraszi avatar image TiborKaraszi MartinCairney-6481 ·

Sorry if I state the obvious (after Erland's post), but the key word you probably overlooked is pair. I.e. a byte pair is indeed two bytes. :-)

As for a T-SQL example not using a table, here goes:

 SELECT DATALENGTH('Microsoft SQL Server     '), DATALENGTH(N'Microsoft SQL Server     ');

Result

 ----------- -----------
 25          50


0 Votes 0 ·

"characters in the UNICODE range 0-65535...use the same storage space as VARCHAR" is incorrect. NVARCHAR uses 2 bytes per BMP character (0-65535) and 4 bytes per Supplementary Character (65536-1114111). For details please see my post: How Many Bytes Per Character in SQL Server.

Using the N prefix is great advice, but it's mainly to avoid data loss (when Unicode chars not available in the code page associated with the current database's default collation get converted to either ? or a "best fit" mapping of something similar). The performance issue is specific to a VARCHAR column using a SQL Server collation compared to an NVARCHAR value. For details please see my post: Impact on Indexes When Mixing VARCHAR and NVARCHAR Types.

0 Votes 0 ·

Hi again Solomon. Did you also miss the word "pair" in that MS statement? I.e., "Unicode range 0-65,535, one character can be stored per each byte-pair. So unless I am out-of-the-ordinary confused, you seem to say the same thing as the documentation: that range uses two bytes.

The fact that both you and Martin (seem to have) missed that "pair" does indeed reflect that there's room for improvement in how MS phrases things, though!

0 Votes 0 ·

Hello. No, I definitely did not miss that word ;-). I was actually responding to Martin, quoting part of this answer. However, my comment was long and I had to cut parts out to fit within the limit (which is not the 1000 that it claims to be, but closer to 985). I updated my comment earlier today to include more of the quote so it would be clearer what I was responding to. Yes, I'm saying the same thing as the documentation, except I prefer to use number of bytes for everything as it's not only clearer, but is consistent when speaking of the other options.

And yes, there are several areas of the documentation that need improvement, but only so much time in a day, and even less these days :-(

0 Votes 0 ·
Cathyji-msft avatar image
0 Votes"
Cathyji-msft answered ·

Hi @SQLGuy10-3060,

If we know that data to be stored in the column or variable doesn’t have any Unicode characters, we can use varchar. But some experts recommends nvarchar always because: since all modern operating systems and development platforms use Unicode internally, using nvarchar rather than varchar, will avoid encoding conversions every time you read from or write to the database.

The major difference between varchar vs nvarchar

  1. Nvarchar stores UNICODE data. If you have requirements to store UNICODE or multilingual data, nvarchar is the choice. Varchar stores ASCII data and should be your data type of choice for normal use.

  2. For nvarchar, when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair, however, in higher Unicode ranges (65,536-1,114,111) one character may use two byte-pairs. Whereas varchar only uses 1 byte.

Advantages and Disadvantages of varchar and nvarchar in SQL Server.
57968-screenshot-2021-01-19-105836.jpg

By the way, starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues.

Please refer to below links to get more information.

Difference Between Sql Server VARCHAR and NVARCHAR Data Type
What is the difference between varchar and nvarchar?
SQL Server differences of char, nchar, varchar and nvarchar data types

Best regards,
Cathy


If the response is helpful, please click "Accept Answer" and upvote it, thank you.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.



· 4 ·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

"By the way, starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues."

I would not go this route, unless I have some very special circumstances. I strongly encourage anyone considering using UTF-8 to first read this: https://www.sqlservercentral.com/blogs/native-utf-8-support-in-sql-server-2019-savior-or-false-prophet

0 Votes 0 ·

Hello @TiborKaraszi (and @Cathyji-msft ). Thanks for referring to my post 😺. Unfortunately, the SQL Server Central version of it is woefully outdated as their blog mirroring no longer refreshes when the original is updated. For a lot of posts that doesn't matter so much. But for this one there have been many updates as I modified it to keep up with their changes across the CTPs through RTM, etc. Please review the source post which might have a slightly different recommendation as compared to the older SQL Server Central version: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

They ended up improving it a bit so it's not so bad now, but still best in a couple of scenarios. Please see my answer below for details.


0 Votes 0 ·

Thanks @SolomonRutzky-6202 . I didn't know you have two versions of the article out there. I do point to your article on a couple of my classes, and that is indeed your QuantumLeap link. For above post, I googled to find your blog quickly, and missed that it wasn't the same as I point to in my classes. Good to know there are two out there (and which to point to).

0 Votes 0 ·
Show more comments
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered ·

as a general rule, should you use varchar instead of nvarchar, if you won't ever be storing foreign language characters?

The problem is: can you really be sure. All of a sudden your business expands and the need arises.

If you start with varchar and need to change later - it will be painful. UTF-8 collations can save you, but not all the way since you need review column lengths.

·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

SolomonRutzky avatar image
0 Votes"
SolomonRutzky answered ·

General

It depends mainly on:

  1. What do you consider "foreign language characters"?

  2. How guaranteed are you that this requirement won't change?


Regarding "foreign language characters", will something like emojis every be an option? Meaning, are you thinking in terms of "no characters outside of the local alphabet and punctuation", or "alphabet and punctuation are local, but emojis might get thrown in"?

For me, I would say that "codes" usually have a stable range of characters and can safely be VARCHAR. For example, postal codes, airline confirmation codes, credit card authorization/confirmation codes, ISO country codes, etc.

But names, whether for people or places, or even products, etc, can increasingly contain "foreign language characters". Even if it's reasonable to expect local street names to contain only local language characters, if you are dealing with a list of customers and the business isn't 100% local, then you can't guarantee only local language characters. So for names, etc it is better to go with NVARCHAR.


UTF-8

UTF-8, introduced in SQL Server 2019, is a tricky option as it helps some scenarios but hurts others. In general, it makes sense to use under one of the following conditions:

  1. You are dealing with app code that submits SQL using VARCHAR and/or string literals without the upper-case N prefix (whether it is a legacy app or an app that works with one or more non-SQL Server RDBMSs that allow for Unicode in VARCHAR and string literals missing the upper-case N prefix).

  2. Your data is primarily US English characters (i.e. standard ASCII / code points 1 - 127) but you either have a small amount of data that isn't standard ASCII or there's at least potential to get non-standard ASCII data, and a significant portion of the data is in NVARCHAR(MAX) columns (i.e. can't benefit from Unicode Compression that is used in Row Compression but doesn't work on NVARCHAR(MAX) ).

For full details, please see my post: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?


Bytes per Character

Now, how many bytes are used per character is also tricky as it depends on the datatype and collation. In general:

  • VARCHAR can be either 1 or 2, though when using a UTF-8 collation it can be anywhere from 1 to 4.

  • NVARCHAR can be either 2 or 4.

For full details, please see my post: How Many Bytes Per Character in SQL Server: a Completely Complete Guide




· 3 ·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Third comment - last one for now :-) For your point 2 above, it seems you are simplifying things and as a blanket statement perhaps saying that nvarchar with row compression is preferred? I.e., without qualifying all your points in the recommendation section of your article...

0 Votes 0 ·

Hi @TiborKaraszi . I was missing a qualification that I just added to my answer (for point #2): the real issue is NVARCHAR(MAX) since that datatype can't make use of Unicode Compression. And in fact, it only benefits from Page Compression when the values are stored in-row. It would be wonderful if Unicode Compression worked in off-row NVARCHAR data (i.e. where one would have the most data that would benefit the most from compression, as opposed to short values in NVARCHAR(1 - 4000) columns), so please vote for: unicode compression nvarchar(max).


0 Votes 0 ·

Hi @SolomonRutzky . Voted! :-)

1 Vote 1 ·