Azure Machine Learning Package For Text Analytics

The Azure Machine Learning Package for Text Analytics (AMLPTA) is a Python package that simplifies the experience of building and deploying high AI quality machine learning and deep learning text analytics models in Azure Machine Learning.

The AMLPTA package supports the following scenarios:

  • Text Classification

    • Binary, multi-class, or multi-label
    • Scikit-learn traditional machine learning algorithms
    • Keras/TensorFlow convolutional or recurrent neural networks
  • Custom Entity Extraction

    • Keras/TensorFlow recurrent neural networks
    • Conditional Random Fields (CRF) models
  • Word Embedding

    • Word2Vec word embedding model training
    • FastText word embedding model training

For more information about each module and class in this package, see the AMLPTA reference documentation.

Why use this package?

The main advantages of Azure Machine Learning Package for Text Analytics are:

  • High AI quality. AMLPTA supports state-of-the-art algorithms and includes default pipelines that were proven to work well on a wide variety of tasks for building accurate text analytics models.

  • Rapid time to solution. The API for AMLPTA automates important text analytics tasks to enable you to build and deploy a text analytics pipeline easily and faster.

  • Flexibility. The powerful and composable Python API gives you out-of-the-box text analytics capabilities with the full power to customize and fine-tune your models.

What's included

Feature Engineering

Feature engineering modules create additional relevant features from the existing raw features in the data to increase the predictive power of the learning algorithm.

The input features of the classifier include:

  • Word n-grams
  • Character n-grams
  • Word2Vec embedding features
  • FastText embedding features
  • Dictionary-based features
  • Pre-trained model prediction as features
  • Part-of-speech tags
  • Orthographic (capitalization) features
  • Embedding clustering

Text Preprocessing

The processing of the text includes:

  • Tokenization
  • Case normalization
  • Lemmatization
  • Stemming
  • Stopword removal
  • Numbers removal
  • Special character removal
  • Dictionary-based normalization

Prerequisites

You will need the following to install and use Azure Machine Learning Package for Text Analytics:

  1. An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.

  2. A provisioned Windows Deep Learning Virtual Machine (DLVM) on Azure or a Data Science Virtual Machine (DSVM).

    • To provision a DLVM, follow these steps.
    • To provision a DSVM, follow these steps
  3. The following accounts and application must be set up and installed:

    • An Azure Machine Learning Experimentation account
    • An Azure Machine Learning Model Management account
    • Azure Machine Learning Workbench installed on the DLVM

    If these accounts don't exist yet or if you haven't installed Workbench, follow the Azure Machine Learning Quickstart and Workbench installation article.

    The Azure Machine Learning Package for Text Analytics runs on Azure Machine Learning Workbench. The DLVM comes with a pre-downloaded installable Workbench. Click on the AzureML Workbench Setup.msi icon on the DLVM desktop to install Workbench.

Install this package

After completing the prerequisites, you can download and install the Azure Machine Learning Package for Text Analytics.

  1. Log into the Deep Learning Virtual Machine (DLVM) or Data Science Virtual Machine (DSVM)

  2. Download the package onto the DLVM\DSVM.

  3. Navigate to the download location and unzip the file to get distribution artifacts under the following directory structure. Notice the install folder.

    install/
        setup.bat
    notebooks/
    README.md
    <licenses...>
    
  4. Launch Azure Machine Learning Workbench. This package must be installed inside the Workbench environment.

  5. From the Workbench menu, select File > Open Command Prompt.

  6. Navigate to the install package folder under the unzipped distribution file.

  7. At the prompt, run the setup.bat script to install the dependencies and package.

    • On a DLVM, enter:

      pip install tensorflow-gpu==1.6
      setup.bat
      
    • On a DSVM, enter:

      pip install tensorflow==1.6
      setup.bat
      

Get started using Jupyter notebooks

The best way to familiarize yourself with the package is through the sample Jupyter notebooks and Python scripts distributed with the package. The sample notebooks explain how to build a model with the package as well as customize that model to your data and needs.

The notebooks cover these scenarios:

  • Text classification
  • Custom entity extraction

The notebooks should be run in the Azure Machine Learning Workbench environment.

To open the text analytics project in Workbench:

  1. Launch Azure Machine Learning Workbench.

  2. From the Workbench menu, select File > Add Existing Folder as Project.

  3. In the Project Directory field, navigate to the notebooks directory. To access the data and documentation referenced by the notebooks, open the entire notebooks directory.

    Note

    Since Workbench projects are capped at 25 MB, store your datasets in a different location than the project file. For some good practices on file storage within AML Workbench, please refer to this document.

To launch a notebook session from the command prompt:

  1. From the Workbench menu, select File > Open Command Prompt.

  2. At the prompt, go to the project root folder.

  3. At the prompt, start a notebook session using this Azure ML CLI command:

    az ml notebook start
    

Your default browser is automatically launched with Jupyter server pointing to the project home directory. You can now open the notebook of your choice and start interacting with it.

By running samples this way, you get the advantages of the Run History and Logging capabilities of Workbench. Additionally, you can store larger files under a folder output that does gets special treatment from Workbench.

Learn more about running Jupyter notebooks from Workbench in this document.

Learn how to build and deploy a text classification model in this document.

For more information about each module and class in this package, see the AMLPTA reference documentation.

Reporting issues

For help with any issues, you encounter using this package, contact amltap@microsoft.com.