Topic Classification using Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a statistical model that classifies a document as a mixture of topics.
The sample uses a HttpTrigger to accept a dataset from a blob and performs the following tasks:
- Tokenization of the entire set of documents using NLTK
- Removes stop words and performs lemmatization on the documents using NLTK.
- Classifies documents into topics using LDA API's from gensim Python library
- Returns a visualization of topics from the dataset using PyLDAVis Python library
Getting Started
Deploy to Azure
Prerequisites
- Install Python 3.6+
- Install Functions Core Tools
- Install Docker
- Note: If run on Windows, use Ubuntu WSL to run deploy script
Steps
- Click Deploy to Azure Button to deploy resources
or
Deploy through Azure CLI
- Open AZ CLI and run
az group create -l [region] -n [resourceGroupName]
to create a resource group in your Azure subscription (i.e. [region] could be westus2, eastus, etc.) - Run
az group deployment create --name [deploymentName] --resource-group [resourceGroupName] --template-file azuredeploy.json
- Open AZ CLI and run
Run
pip install nltk
to install the NLTK Python packageRun
python3 deploy/download.py
to download dataset, tokenizers and stopwords from NLTK. Typically this will get downloaded to $HOME/nltk_dataMake sure you have a service principal created. Follow instructions here
Run
sh deploy/deploy.sh
(in Ubuntu WSL or any shell) to deploy function code and content to blob containers.Deploy Function App
- Create/Activate virtual environment
- Run
func azure functionapp publish [functionAppName] --build-native-deps
Test
- Send the following body in a HTTP POST request
{
"container_name" : "dataset",
"num_topics" : "5"
}
- Sample response
{
"lda_model_url": "https://ldamdlstore.blob.core.windows.net/ldamodel/ldamodel",
"token_data_url": "https://ldamdlstore.blob.core.windows.net/ldamodel/token_data"
}
Visualizing topics through PyLDAVis
Open the jupyter notebook VisualizeTopics.ipynb file using instructions here
In the notebook, plugin values from sample response for LDA_MODEL_BLOB_URL and TOKEN_DATA_URL
Inline-style: