Solution ideas
This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.
This article describes how you can use Microsoft AI to improve website content tagging accuracy by combining deep learning and natural language processing (NLP) with data on site-specific search terms.
Architecture
Download a Visio file of this architecture.
Dataflow
Data is stored in various formats, depending on its original source. Data can be stored as files within Azure Data Lake Storage or in tabular form in Azure Synapse or Azure SQL Database.
Azure Machine Learning (ML) can connect and read from such sources, to ingest the data into the NLP pipeline for pre-processing, model training, and post-processing.
NLP pre-processing includes several steps to consume data, with the purpose of text generalization. Once the text is broken up into sentences, NLP techniques, such as lemmatization or stemming, allow the language to be tokenized in a general form.
As NLP models are already available pre-trained, the transfer learning approach recommends that you download language-specific embeddings and use an industry standard model, for multi-class text classification, such as variations of BERT.
NLP post-processing recommends storing the model in a model register in Azure ML, to track model metrics. Furthermore, text can be post-processed with specific business rules that are deterministically defined, based on the business goals. Microsoft recommends using ethical AI tools to detect biased language, which ensures the fair training of a language model.
The model can be deployed through Azure Kubernetes Service, while running a Kubernetes-managed cluster where the containers are deployed from images that are stored in Azure Container Registry. Endpoints can be made available to a front-end application. The model can be deployed through Azure Kubernetes Service as real-time endpoints.
Model results can be written to a storage option in file or tabular format, then properly indexed by Azure Cognitive Search. The model would run as batch inference and store the results in the respective datastore.
Components
- Data Lake Storage for Big Data Analytics
- Azure Machine Learning
- Azure Cognitive Search
- Azure Container Registry
- Azure Kubernetes Service (AKS)
Scenario details
Social sites, forums, and other text-heavy Q&A services rely heavily on content tagging, which enables good indexing and user search. Often, however, content tagging is left to users' discretion. Because users don't have lists of commonly searched terms or a deep understanding of the site structure, they frequently mislabel content. Mislabeled content is difficult or impossible to find when it's needed later.
Potential use cases
By using natural language processing (NLP) with deep learning for content tagging, you enable a scalable solution to create tags across content. As users search for content by keywords, this multi-class classification process enriches untagged content with labels that will allow you to search on substantial portions of text, which improves the information retrieval processes. New incoming content will be appropriately tagged by running NLP inference.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
- Louis Li | Senior Customer Engineer
Next steps
See the product documentation:
- Azure Data Lake Storage Gen2 Introduction
- Azure Machine Learning
- Azure Cognitive Search documentation
- Learn more about Azure Container Registry
- Azure Kubernetes Service
Try these Microsoft Learn modules:
- Introduction to Natural Language Processing with PyTorch
- Train and evaluate deep learning models
- Implement knowledge mining with Azure Cognitive Search
Related resources
See the following related architectural articles:
- Natural language processing technology
- Build a delta lake to support ad hoc queries in online leisure and travel booking
- Query a data lake or lakehouse by using Azure Synapse serverless
- Machine learning operations (MLOps) framework to upscale machine learning lifecycle with Azure Machine Learning
- Introduction to predictive maintenance in manufacturing
- Predictive maintenance solution