Topic Classification using Latent Dirichlet Allocation

Code Sample
10/15/2019

Browse code

Latent Dirichlet Allocation (LDA) is a statistical model that classifies a document as a mixture of topics.

The sample uses a HttpTrigger to accept a dataset from a blob and performs the following tasks:

Tokenization of the entire set of documents using NLTK
Removes stop words and performs lemmatization on the documents using NLTK.
Classifies documents into topics using LDA API's from gensim Python library
Returns a visualization of topics from the dataset using PyLDAVis Python library

Getting Started

Deploy to Azure

Prerequisites

Install Python 3.6+
Install Functions Core Tools
Install Docker
Note: If run on Windows, use Ubuntu WSL to run deploy script

Steps

Click Deploy to Azure Button to deploy resources

Deploy through Azure CLI
- Open AZ CLI and run az group create -l [region] -n [resourceGroupName] to create a resource group in your Azure subscription (i.e. [region] could be westus2, eastus, etc.)
- Run az group deployment create --name [deploymentName] --resource-group [resourceGroupName] --template-file azuredeploy.json
Run pip install nltk to install the NLTK Python package
Run python3 deploy/download.py to download dataset, tokenizers and stopwords from NLTK. Typically this will get downloaded to $HOME/nltk_data
Make sure you have a service principal created. Follow instructions here
Run sh deploy/deploy.sh (in Ubuntu WSL or any shell) to deploy function code and content to blob containers.
Deploy Function App
- Create/Activate virtual environment
- Run func azure functionapp publish [functionAppName] --build-native-deps

Test

Send the following body in a HTTP POST request

{
    "container_name" : "dataset",
    "num_topics" : "5" 
}

Sample response

{
    "lda_model_url": "https://ldamdlstore.blob.core.windows.net/ldamodel/ldamodel",
    "token_data_url": "https://ldamdlstore.blob.core.windows.net/ldamodel/token_data"
}

Visualizing topics through PyLDAVis
- Open the jupyter notebook VisualizeTopics.ipynb file using instructions here
- In the notebook, plugin values from sample response for LDA_MODEL_BLOB_URL and TOKEN_DATA_URL

Inline-style: alt text