Topic Classification using Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a statistical model that classifies a document as a mixture of topics.

The sample uses a HttpTrigger to accept a dataset from a blob and performs the following tasks:

  • Tokenization of the entire set of documents using NLTK
  • Removes stop words and performs lemmatization on the documents using NLTK.
  • Classifies documents into topics using LDA API's from gensim Python library
  • Returns a visualization of topics from the dataset using PyLDAVis Python library

Getting Started

Deploy to Azure


  • Install Python 3.6+
  • Install Functions Core Tools
  • Install Docker
  • Note: If run on Windows, use Ubuntu WSL to run deploy script


  • Click Deploy to Azure Button to deploy resources

Deploy to Azure


  • Deploy through Azure CLI

    • Open AZ CLI and run az group create -l [region] -n [resourceGroupName] to create a resource group in your Azure subscription (i.e. [region] could be westus2, eastus, etc.)
    • Run az group deployment create --name [deploymentName] --resource-group [resourceGroupName] --template-file azuredeploy.json
  • Run pip install nltk to install the NLTK Python package

  • Run python3 deploy/ to download dataset, tokenizers and stopwords from NLTK. Typically this will get downloaded to $HOME/nltk_data

  • Make sure you have a service principal created. Follow instructions here

  • Run sh deploy/ (in Ubuntu WSL or any shell) to deploy function code and content to blob containers.

  • Deploy Function App


  • Send the following body in a HTTP POST request
    "container_name" : "dataset",
    "num_topics" : "5" 
  • Sample response
    "lda_model_url": "",
    "token_data_url": ""
  • Visualizing topics through PyLDAVis

    • Open the jupyter notebook VisualizeTopics.ipynb file using instructions here

    • In the notebook, plugin values from sample response for LDA_MODEL_BLOB_URL and TOKEN_DATA_URL

Inline-style: alt text