Extract N-Gram Features from Text module reference

This article describes a module of the visual interface (preview) for the Azure Machine Learning service. Use the Extract N-Gram Features from Text module to featurize unstructured text data.

Configuration of the Extract N-Gram Features from Text module

The module supports the following scenarios for using an n-gram dictionary:

Create a new n-gram dictionary

  1. Add the Extract N-Gram Features from Text module to your experiment, and connect the dataset that has the text you want to process.

  2. Use Text column to choose a column of string type that contains the text you want to extract. Because results are verbose, you can process only a single column at a time.

  3. Set Vocabulary mode to Create to indicate that you're creating a new list of n-gram features.

  4. Set N-Grams size to indicate the maximum size of the n-grams to extract and store.

    For example, if you enter 3, unigrams, bigrams, and trigrams will be created.

  5. Weighting function specifies how to build the document feature vector and how to extract vocabulary from documents.

    • Binary Weight: Assigns a binary presence value to the extracted n-grams. The value for each n-gram is 1 when it exists in the document, and 0 otherwise.

    • TF Weight: Assigns a term frequency (TF) score to the extracted n-grams. The value for each n-gram is its occurrence frequency in the document.

    • IDF Weight: Assigns an inverse document frequency (IDF) score to the extracted n-grams. The value for each n-gram is the log of corpus size divided by its occurrence frequency in the whole corpus.

      IDF = log of corpus_size / document_frequency

    • TF-IDF Weight: Assigns a term frequency/inverse document frequency (TF/IDF) score to the extracted n-grams. The value for each n-gram is its TF score multiplied by its IDF score.

  6. Set Minimum word length to the minimum number of letters that can be used in any single word in an n-gram.

  7. Use Maximum word length to set the maximum number of letters that can be used in any single word in an n-gram.

    By default, up to 25 characters per word or token are allowed.

  8. Use Minimum n-gram document absolute frequency to set the minimum occurrences required for any n-gram to be included in the n-gram dictionary.

    For example, if you use the default value of 5, any n-gram must appear at least five times in the corpus to be included in the n-gram dictionary.

  9. Set Maximum n-gram document ratio to the maximum ratio of the number of rows that contain a particular n-gram, over the number of rows in the overall corpus.

    For example, a ratio of 1 would indicate that, even if a specific n-gram is present in every row, the n-gram can be added to the n-gram dictionary. More typically, a word that occurs in every row would be considered a noise word and would be removed. To filter out domain-dependent noise words, try reducing this ratio.

    Important

    The rate of occurrence of particular words is not uniform. It varies from document to document. For example, if you're analyzing customer comments about a specific product, the product name might be very high frequency and close to a noise word, but be a significant term in other contexts.

  10. Select the option Normalize n-gram feature vectors to normalize the feature vectors. If this option is enabled, each n-gram feature vector is divided by its L2 norm.

  11. Run the experiment.

Use an existing n-gram dictionary

  1. Add the Extract N-Gram Features from Text module to your experiment, and connect the dataset that has the text you want to process to the Dataset port.

  2. Use Text column to select the text column that contains the text you want to featurize. By default, the module selects all columns of type string. For best results, process a single column at a time.

  3. Add the saved dataset that contains a previously generated n-gram dictionary, and connect it to the Input vocabulary port. You can also connect the Result vocabulary output of an upstream instance of the Extract N-Gram Features from Text module.

  4. For Vocabulary mode, select the ReadOnly update option from the drop-down list.

    The ReadOnly option represents the input corpus for the input vocabulary. Rather than computing term frequencies from the new text dataset (on the left input), the n-gram weights from the input vocabulary are applied as is.

    Tip

    Use this option when you're scoring a text classifier.

  5. For all other options, see the property descriptions in the previous section.

  6. Run the experiment.

Score or publish a model that uses n-grams

  1. Copy the Extract N-Gram Features from Text module from the training dataflow to the scoring dataflow.

  2. Connect the Result Vocabulary output from the training dataflow to Input Vocabulary on the scoring dataflow.

  3. In the scoring workflow, modify the Extract N-Gram Features from Text module and set the Vocabulary mode parameter to ReadOnly. Leave all else the same.

  4. To publish the experiment, save Result Vocabulary as a dataset.

  5. Connect the saved dataset to the Extract N-Gram Features from Text module in your scoring graph.

Results

The Extract N-Gram Features from Text module creates two types of output:

  • Result dataset: This output is a summary of the analyzed text combined with the n-grams that were extracted. Columns that you didn't select in the Text column option are passed through to the output. For each column of text that you analyze, the module generates these columns:

    • Matrix of n-gram occurrences: The module generates a column for each n-gram found in the total corpus and adds a score in each column to indicate the weight of the n-gram for that row.
  • Result vocabulary: The vocabulary contains the actual n-gram dictionary, together with the term frequency scores that are generated as part of the analysis. You can save the dataset for reuse with a different set of inputs, or for a later update. You can also reuse the vocabulary for modeling and scoring.

Result vocabulary

The vocabulary contains the n-gram dictionary with the term frequency scores that are generated as part of the analysis. The DF and IDF scores are generated regardless of other options.

  • ID: An identifier generated for each unique n-gram.
  • NGram: The n-gram. Spaces or other word separators are replaced by the underscore character.
  • DF: The term frequency score for the n-gram in the original corpus.
  • IDF: The inverse document frequency score for the n-gram in the original corpus.

You can manually update this dataset, but you might introduce errors. For example:

  • An error is raised if the module finds duplicate rows with the same key in the input vocabulary. Be sure that no two rows in the vocabulary have the same word.
  • The input schema of the vocabulary datasets must match exactly, including column names and column types.
  • The ID column and DF column must be of the integer type.
  • The IDF column must be of the float type.

Note

Don't connect the data output to the Train Model module directly. You should remove free text columns before they're fed into the Train Model. Otherwise, the free text columns will be treated as categorical features.

Next steps

See the set of modules available to the Azure Machine Learning service.