Preprocess Text

This article describes a module of the visual interface (preview) for Azure Machine Learning service.

Use the Preprocess Text module to clean and simplify text. It supports these common text processing operations:

  • Removal of stop-words
  • Using regular expressions to search for and replace specific target strings
  • Lemmatization, which converts multiple related words to a single canonical form
  • Case normalization
  • Removal of certain classes of characters, such as numbers, special characters, and sequences of repeated characters such as "aaaa"
  • Identification and removal of emails and URLs

The Preprocess Text module currently only supports English.

Configure Text Preprocessing

  1. Add the Preprocess Text module to your experiment in Azure Machine Learning Service. You can find this module under Text Analytics.

  2. Connect a dataset that has at least one column containing text.

  3. Select the language from the Language dropdown list.

  4. Text column to clean: Select the column that you want to preprocess.

  5. Remove stop words: Select this option if you want to apply a predefined stopword list to the text column.

    Stopword lists are language-dependent and customizable.

  6. Lemmatization: Select this option if you want words to be represented in their canonical form. This option is useful for reducing the number of unique occurrences of otherwise similar text tokens.

    The lemmatization process is highly language-dependent..

  7. Detect sentences: Select this option if you want the module to insert a sentence boundary mark when performing analysis.

    This module uses a series of three pipe characters ||| to represent the sentence terminator.

  8. Perform optional find-and-replace operations using regular expressions.

    • Custom regular expression: Define the text you're searching for.
    • Custom replacement string: Define a single replacement value.
  9. Normalize case to lowercase: Select this option if you want to convert ASCII uppercase characters to their lowercase forms.

    If characters aren't normalized, the same word in uppercase and lowercase letters is considered two different words.

  10. You can also remove the following types of characters or character sequences from the processed output text:

    • Remove numbers: Select this option to remove all numeric characters for the specified language. Identification numbers are domain-dependent and language dependent. If numeric characters are an integral part of a known word, the number might not be removed.

    • Remove special characters: Use this option to remove any non-alphanumeric special characters.

    • Remove duplicate characters: Select this option to remove extra characters in any sequences that repeat for more than twice. For example, a sequence like "aaaaa" would be reduced to "aa".

    • Remove email addresses: Select this option to remove any sequence of the format <string>@<string>.

    • Remove URLs: Select this option to remove any sequence that includes the following URL prefixes: http, https, ftp, www

  11. Expand verb contractions: This option applies only to languages that use verb contractions; currently, English only.

    For example, by selecting this option, you could replace the phrase "wouldn't stay there" with "would not stay there".

  12. Normalize backslashes to slashes: Select this option to map all instances of \\ to /.

  13. Split tokens on special characters: Select this option if you want to break words on characters such as &, -, and so forth. This option can also reduce the special characters when it repeats more than twice.

    For example, the string MS---WORD would be separated into three tokens, MS, -, and WORD.

  14. Run the experiment.

Next steps

See the set of modules available to Azure Machine Learning service.