WordsSegmenter Class

Reference

Definition

Namespace:: Windows.Data.Text

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Edit

A segmenter class that is able to segment provided text into words or word stems (depending on the particular language).

public ref class WordsSegmenter sealed

/// [Windows.Foundation.Metadata.Activatable(Windows.Data.Text.IWordsSegmenterFactory, 65536, Windows.Foundation.UniversalApiContract)]
/// [Windows.Foundation.Metadata.ContractVersion(Windows.Foundation.UniversalApiContract, 65536)]
/// [Windows.Foundation.Metadata.MarshalingBehavior(Windows.Foundation.Metadata.MarshalingType.Agile)]
/// [Windows.Foundation.Metadata.Threading(Windows.Foundation.Metadata.ThreadingModel.Both)]
class WordsSegmenter final

/// [Windows.Foundation.Metadata.ContractVersion(Windows.Foundation.UniversalApiContract, 65536)]
/// [Windows.Foundation.Metadata.MarshalingBehavior(Windows.Foundation.Metadata.MarshalingType.Agile)]
/// [Windows.Foundation.Metadata.Threading(Windows.Foundation.Metadata.ThreadingModel.Both)]
/// [Windows.Foundation.Metadata.Activatable(Windows.Data.Text.IWordsSegmenterFactory, 65536, "Windows.Foundation.UniversalApiContract")]
class WordsSegmenter final

[Windows.Foundation.Metadata.Activatable(typeof(Windows.Data.Text.IWordsSegmenterFactory), 65536, typeof(Windows.Foundation.UniversalApiContract))]
[Windows.Foundation.Metadata.ContractVersion(typeof(Windows.Foundation.UniversalApiContract), 65536)]
[Windows.Foundation.Metadata.MarshalingBehavior(Windows.Foundation.Metadata.MarshalingType.Agile)]
[Windows.Foundation.Metadata.Threading(Windows.Foundation.Metadata.ThreadingModel.Both)]
public sealed class WordsSegmenter

[Windows.Foundation.Metadata.ContractVersion(typeof(Windows.Foundation.UniversalApiContract), 65536)]
[Windows.Foundation.Metadata.MarshalingBehavior(Windows.Foundation.Metadata.MarshalingType.Agile)]
[Windows.Foundation.Metadata.Threading(Windows.Foundation.Metadata.ThreadingModel.Both)]
[Windows.Foundation.Metadata.Activatable(typeof(Windows.Data.Text.IWordsSegmenterFactory), 65536, "Windows.Foundation.UniversalApiContract")]
public sealed class WordsSegmenter

function WordsSegmenter(language)

Public NotInheritable Class WordsSegmenter

Inheritance: Object Platform::Object IInspectable WordsSegmenter

Attributes: ActivatableAttribute ContractVersionAttribute MarshalingBehaviorAttribute ThreadingAttribute

Windows requirements

Device family	Windows 10 (introduced in 10.0.10240.0 - for Xbox, see UWP features that aren't yet supported on Xbox)
API contract	Windows.Foundation.UniversalApiContract (introduced in v1.0)

Remarks

For languages that do not use spaces between words (such as Japanese, Chinese, Korean, and Thai), use of a segmenter is the only way to obtain individual words for textual processing scenarios such as keyword search.

The language supplied when this object is constructed is matched against the languages with word breakers on the system, and the best word segmentation rules available are used. The language need not be one of the app's supported languages. If there are no supported language rules available specifically for that language, the language-neutral rules are used (an implementation of Unicode Standard Annex #29 Unicode Text Segmentation), and the ResolvedLanguage property is set to "und" (undetermined language).

For keyword search scenarios, it is always recommended to request a segmenter in the language of the text contents.

For spellchecking scenarios, some language segmenters (such as German) may return multiple word stem segments for a single compound word. In contrast, the spellchecking APIs may expect the words to be kept together as a single word. For such languages, you may choose to force language-neutral segmenting rules by explicitly requesting the "und" (undetermined language) segmenter. However, doing so will greatly reduce the breaking quality of non-spaced languages. Therefore, it is recommended that you use the Language.Script API to determine if the content language uses one of the following non-spaced scripts:

Script	Language
Bopo	Bopomofo
Brah	Brahmi
Egyp	Egyptian Hieroglyphs
Goth	Gothic
Hang	Hangul
Hang	Hiragana
Hang	Old Hangul
Hani	Han
Ital	Old Italic
Java	Javanese
Kana	Katakana
Khar	Kharoshthi
Khmr	Khmer
Laoo	Lao
Lisu	Lisu
Mymr	Myanmar
Talu	New Tai Lue
Thai	Thai
Tibt	Tibetan
Xsux	Cuneiform
Yiii	Yi

If none of these scripts are found, then it should be safe to use "und" for spellchecking scenario segmentation.

Constructors

WordsSegmenter(String)

Creates a WordsSegmenter object. See the introduction in WordsSegmenter for a description of how the language supplied to this constructor is used.

Properties

ResolvedLanguage

Gets the language of the rules used by this WordsSegmenter object.

"und" (undetermined) is returned if we are using language-neutral rules.

Methods

GetTokenAt(String, UInt32)	Determines and returns the word or word stem which contains or follows a specified index into the provided text.
GetTokens(String)	Determines and returns all of the words or word stems in the provided text.
Tokenize(String, UInt32, WordSegmentsTokenizingHandler)	Calls the provided handler with two iterators that iterate through the words prior to and following a given index into the provided text.

Applies to