Considerations when you use the SQL Server Full-Text search engine for the Japanese language

This article describes the considerations that apply when you use the SQL Server Full-Text Search engine for the Japanese language.

Applies to:   SQL Server
Original KB number:   2252955

Introduction

In the Japanese language, a phrase may consist of two or more words without spaces between those words. In Microsoft SQL Server, when you use the SQL Server Full-Text search engine to perform a prefix search for a Japanese phrase, the Full-Text search engine does not consider the phrase to be a prefix term. Instead, the Full-Text search engine considers the phrase to be word terms. This is because a word is defined as a string of characters without spaces or punctuation. Additionally, the search engine works only in the prefix-matching mode. The search engine does not work in the suffix-matching mode.

More information

For example, you create a table and insert some Japanese phrases by running the following statements in SQL Server:

CREATE TABLE test(c1 int PRIMARY KEY,c2 nvarchar(255))

INSERT test VALUES(1,N':::no-loc text="添付テスト":::')
INSERT test VALUES(2,N':::no-loc text="Fw: テスト":::')
INSERT test VALUES(3,N':::no-loc text="KK-Information:テスト":::')
INSERT test VALUES(4,N':::no-loc text="[Q] ポリシーテスト":::')
INSERT test VALUES(5,N':::no-loc text="KK-Information:タイトルフィルタテスト2":::')
INSERT test VALUES(6,N':::no-loc text="テスト":::')
INSERT test VALUES(7,N':::no-loc text="フィルタテスト3":::')
INSERT test VALUES(8,N':::no-loc text="テストフィルタ1":::')
INSERT test VALUES(9,N':::no-loc text="RE: テストメール":::')
INSERT test VALUES(10,N':::no-loc text="テストメール":::')
INSERT test VALUES(11,N':::no-loc text="White Listテスト":::')
INSERT test VALUES(12,N':::no-loc text="フィルタリングテスト":::')

CREATE FULLTEXT CATALOG test AS DEFAULT;
GO

CREATE FULLTEXT INDEX ON test(c2) KEY INDEX PK__<IndexName>;
GO

Then, you run the following three queries:

  • Query 1

    SELECT * FROM test WHERE CONTAINS(c2, N':::no-loc text="テスト":::')
    

    The result of Query 1 is as follows:

    c1c2

    2Fw: テスト
    3KK-Information:テスト
    6テスト

  • Query 2

    SELECT * FROM test WHERE CONTAINS(c2, '":::no-loc text="テスト*":::"')
    

    The result of Query 2 is as follows:

    c1c2
    2 Fw: テスト
    3 KK-Information:テスト
    6 テスト
    8 テストフィルタ1
    9 RE: テストメール
    10 テストメール

  • Query 3

    SELECT * FROM test WHERE CONTAINS(c2, '"*:::no-loc text="テスト":::*"')
    

    The result of Query 3 is as follows:

    c1c2
    2 Fw: テスト
    3 KK-Information:テスト
    6 テスト
    8 テストフィルタ1
    9 RE: テストメール
    10 テストメール

    From the results of the queries, you can find that the result of Query 2 is the same as the result of Query 3 because the Full-Text query does not work in the suffix-matching mode. Additionally, テスト is a token that differs from ポリシーテスト or from テスト in the matchings.

    To tokenize phrases, a word breaker for the language family must be used. Work breakers use spaces and other signs to recognize phrases. Therefore, some phrases cannot be recognized by the word breaker and cannot be searched by using Full-Text engine in the Japanese language. For more information about word breakers, see the Word Breakers and Stemmers topic in the Reference section.

    The best practice to use the Full-Text search engine in the Japanese language is to test the phrases to see whether the phrases are affected by the limitation. If a phrase consists of words without spaces, you cannot use the Full-Text functionality to search the phrase. Instead, you can use the LIKE keyword together with wildcard characters. However, the performance of the like operation is lower than the performance of the Full-Text searching. You must consider the performance effect for your application.

    The following are some sample queries of the like keyword to search for phrases.

  • Query 4

    SELECT * FROM test WHERE c2 like ':::no-loc text="テスト":::%'
    

    The result is as follows:

    c1c2
    6 テスト
    8 テストフィルタ1
    10 テストメール

  • Query 5

    SELECT * FROM test WHERE c2 like '%:::no-loc text="テスト":::%'
    

    The result is as follows:

    c1c2
    1 添付テスト
    2 Fw: テスト
    3 KK-Information:テスト
    4 [Q] ポリシーテスト
    5 KK-Information:タイトルフィルタテスト2
    6 テスト
    7 テストフィルタテスト3
    8 テストテストフィルタ1
    9 RE: テストテストメール
    10 テストテストメール
    11 テスト
    12 フィルタリングテスト

Note

If you use the Full-Text search engine in SQL Server, you can find more information about the content of a full-text index by using the following query:

SELECT * FROM sys.dm_fts_index_keywords(db_id('test'), object_id('test'))
GO

The result is as follows:

keyword display_term column_id document_count
0x00660077 fw 2 1
0x0069006E0066006F0072006D006100740069006F006E information 2 2
0x006B006B kk 2
0x006C00690073007430C630B930C8 listテスト 2 1
0x00770068006900740065 white 2 1
0x30BF30A430C830EB30D530A330EB30BF30C630B930C80032 タイトルフィルタテスト2 2 1
0x30C630B930C8 テスト 2 3
0x30C630B930C830D530A330EB30BF0031 テストフィルタ1 2 1
0x30C630B930C830E130FC30EB テストメール 2 2
0x30D530A330EB30BF30C630B930C80033 フィルタテスト3 2 1
0x30D530A330EB30BF30EA30F330B030C630B930C8 フィルタリングテスト 2 1
0x30DD30EA30B730FC30C630B930C8 ポリシーテスト 2 1
0x6DFB4ED830C630B930C8 添付テスト 2 1
0xFF END OF FILE 2 12
(14 row(s) affected)

In the sample result, only three rows contain the word テスト." The Full-Text search engine treats the word "テスト" as a different token from the word "テストメール. For more information about the SQL Server Full-Text search engine, visit the following Microsoft websites: