WordsSegmenter
WordsSegmenter
WordsSegmenter
WordsSegmenter
Class
Definition
A segmenter class that is able to segment provided text into words or word stems (depending on the particular language).
public : sealed class WordsSegmenter : IWordsSegmenterpublic sealed class WordsSegmenter : IWordsSegmenterPublic NotInheritable Class WordsSegmenter Implements IWordsSegmenter// You can use this class in JavaScript.
- Attributes
| Device family |
Windows 10 (introduced v10.0.10240.0 - for Xbox, see UWP features that aren't yet supported on Xbox)
|
| API contract |
Windows.Foundation.UniversalApiContract (introduced v1)
|
Remarks
For languages that do not use spaces between words (such as Japanese, Chinese, Korean, and Thai), use of a segmenter is the only way to obtain individual words for textual processing scenarios such as keyword search.
The language supplied when this object is constructed is matched against the languages with word breakers on the system, and the best word segmentation rules available are used. The language need not be one of the app's supported languages. If there are no supported language rules available specifically for that language, the language-neutral rules are used (an implementation of Unicode Standard Annex #29 Unicode Text Segmentation), and the ResolvedLanguage property is set to "und" (undetermined language).
For keyword search scenarios, it is always recommended to request a segmenter in the language of the text contents.
For spellchecking scenarios, some language segmenters (such as German) may return multiple word stem segments for a single compound word. In contrast, the spellchecking APIs may expect the words to be kept together as a single word. For such languages, you may choose to force language-neutral segmenting rules by explicitly requesting the "und" (undetermined language) segmenter. However, doing so will greatly reduce the breaking quality of non-spaced languages. Therefore, it is recommended that you use the Language.Script API to determine if the content language uses one of the following non-spaced scripts:
| Script | Language |
|---|---|
| Bopo | Bopomofo |
| Brah | Brahmi |
| Egyp | Egyptian Hieroglyphs |
| Goth | Gothic |
| Hang | Hangul |
| Hang | Hiragana |
| Hang | Old Hangul |
| Hani | Han |
| Ital | Old Italic |
| Java | Javanese |
| Kana | Katakana |
| Khar | Kharoshthi |
| Khmr | Khmer |
| Laoo | Lao |
| Lisu | Lisu |
| Mymr | Myanmar |
| Talu | New Tai Lue |
| Thai | Thai |
| Tibt | Tibetan |
| Xsux | Cuneiform |
| Yiii | Yi |
If none of these scripts are found, then it should be safe to use "und" for spellchecking scenario segmentation.
Constructors
WordsSegmenter(String) WordsSegmenter(String) WordsSegmenter(String) WordsSegmenter(String)
Creates a WordsSegmenter object. See the introduction in WordsSegmenter for a description of how the language supplied to this constructor is used.
public : WordsSegmenter(PlatForm::String language)public WordsSegmenter(String language)Public Sub New(language As String)// You can use this method in JavaScript.
- language
- PlatForm::String String String String
A BCP-47 language tag.
Properties
ResolvedLanguage ResolvedLanguage ResolvedLanguage ResolvedLanguage
Gets the language of the rules used by this WordsSegmenter object.
"und" (undetermined) is returned if we are using language-neutral rules.
public : PlatForm::String ResolvedLanguage { get; }public string ResolvedLanguage { get; }Public ReadOnly Property ResolvedLanguage As string// You can use this property in JavaScript.
- Value
- PlatForm::String string string string
The BCP-47 language tag of the rules employed.
Methods
GetTokenAt(String, UInt32) GetTokenAt(String, UInt32) GetTokenAt(String, UInt32) GetTokenAt(String, UInt32)
Determines and returns the word or word stem which contains or follows a specified index into the provided text.
public : WordSegment GetTokenAt(PlatForm::String text, unsigned int startIndex)public WordSegment GetTokenAt(String text, UInt32 startIndex)Public Function GetTokenAt(text As String, startIndex As UInt32) As WordSegment// You can use this method in JavaScript.
- text
- PlatForm::String String String String
Provided text from which the word or word stem is to be returned.
- startIndex
- unsigned int UInt32 UInt32 UInt32
A zero-based index into text. It must be less than the length of text.
A WordSegment that represents the word or word stem.
Remarks
Note that some languages do not use spaces (such as Japanese or Chinese) and some languages may return multiple word stems for compound words (such as German).
GetTokens(String) GetTokens(String) GetTokens(String) GetTokens(String)
Determines and returns all of the words or word stems in the provided text.
public : IVectorView<WordSegment> GetTokens(PlatForm::String text)public IReadOnlyList<WordSegment> GetTokens(String text)Public Function GetTokens(text As String) As IReadOnlyList( Of WordSegment )// You can use this method in JavaScript.
- text
- PlatForm::String String String String
Provided text containing words or word stems to be returned.
A collection of WordSegment objects that represent the words or word stems.
Remarks
Note that some languages do not use spaces (such as Japanese or Chinese) and some languages may return multiple word stems for compound words (such as German).
Tokenize(String, UInt32, WordSegmentsTokenizingHandler) Tokenize(String, UInt32, WordSegmentsTokenizingHandler) Tokenize(String, UInt32, WordSegmentsTokenizingHandler) Tokenize(String, UInt32, WordSegmentsTokenizingHandler)
Calls the provided handler with two iterators that iterate through the words prior to and following a given index into the provided text.
public : void Tokenize(PlatForm::String text, unsigned int startIndex, WordSegmentsTokenizingHandler handler)public void Tokenize(String text, UInt32 startIndex, WordSegmentsTokenizingHandler handler)Public Function Tokenize(text As String, startIndex As UInt32, handler As WordSegmentsTokenizingHandler) As void// You can use this method in JavaScript.
- text
- PlatForm::String String String String
Provided text containing words to be returned.
- startIndex
- unsigned int UInt32 UInt32 UInt32
A zero-based index into text. It must be less than the length of text.
- handler
- WordSegmentsTokenizingHandler WordSegmentsTokenizingHandler WordSegmentsTokenizingHandler WordSegmentsTokenizingHandler
The function that receives the iterators.
Remarks
The iterators in WordSegmentsTokenizingHandler are lazy and evaluate small chunks of text at a time.
The handler is called at most once per call to Tokenize. The handler is not called if there are no selectable words in text.
- See Also