Introduce the Turkish I issue
Today, I will discuss the famous Turkish I issue. Below is one customer's problem related to Turkish collation?
When I use TURKISH_CI_AS collation, some of the field names in the database becomes case-sensitive, as a result queries are returning Invalid Column Name exception if there is a difference between the case of queries and case in the table definition.
This error is particularly evident with 'I' character. Any field name with 'I' character results in error. I'm not sure about what other characters will cause error.
I use following script to demo this issue.
create database TurkishDB collate turkish_ci_as;
create table Information(c1 int, c2 Int);
select * from Information;
(0 row(s) affected)
select * from information
Msg 208, Level 16, State 1, Line 1
Invalid object name 'information'.
As you can see from the result, querying table Information get correct result, but querying information
Does not. The problem here is that in Turkish, the upper case of character of i (lower case letter i with dot) is İ (capital i with dot). Also, the lower case of character of I (capital i without dot) is (lower case letter i without dot). In one word, i and I are not a match pair any more. In SQL Server's Turkish_CI_AS collation, even we do case insensitive comparison, i and I are not equal.
In SQL Server, the nvarchar columns of the system tables use the same collation as the database collation. As a result, when we search for Object information in the above database, it does not match the Object name Information any more. That is the root reason of the above issue.
This is one more trick in the above example. In the above creating statement, int and Int both works fine. Here is the answer from our dev:
Built-in function names (including min and MAX) are matched like TSQL keywords (English, case insensitive). Name binding for all user objects would follow the collation rules. Note that persisted system objects (from sys. Schema) also require name binding using current collation.
Internally, we use Latin1_General_CI_AS to match keywords. In the later stage of query compilation, we need bind a given object name with internal object id, we use the current database's collation for name matching.
Personal, I think the system table should use a language invariant collation to do name matching instead of using the current database's collation. That is another good reason for introducing language neutral, binary collation.
Finally, here is my suggestion of avoid the Turkish I issue. First, for object name, etc, try use either all lower case letter, or all upper case letter. In addition, when reference to the object during query, make sure you use the same object name in your query. Second, keep in mind, this Turkish I issue also happen in your data. For example, when you match input values in one of your stored procedure, and do different operations correspondingly. You might consider to use a more generic collation, such as latin_general_ci_as to do string match. Third, keep in mind, upper() and lower() built-in always was impacted by this issue, you might also use a more generic collation for this. For example, suppose your default collation is english collation
UPPER ('i') collate turkish_ci_as --will give you english upper/case rule, but the result is covered to turkish collation
select UPPER ('i' collate turkish_ci_as ) --will give your turkish collation rule, the result is also turkish collation
select @a = UPPER (@a collate turkish_ci_as ) --will give your turkish collation rule, the result is also english collation