LatentDirichletAllocationEstimator Class

Definition

The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.

public sealed class LatentDirichletAllocationEstimator : Microsoft.ML.IEstimator<Microsoft.ML.Transforms.Text.LatentDirichletAllocationTransformer>
type LatentDirichletAllocationEstimator = class
    interface IEstimator<LatentDirichletAllocationTransformer>
Public NotInheritable Class LatentDirichletAllocationEstimator
Implements IEstimator(Of LatentDirichletAllocationTransformer)
Inheritance
LatentDirichletAllocationEstimator
Implements

Remarks

Estimator Characteristics

Does this estimator need to look at the data to train its parameters? Yes
Input column data type Vector of Single
Output column data type Vector of Single
Exportable to ONNX No

Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers semantic structure from text data, and ultimately helps answer the question on "what is this document about?". It can be used to featurize any text fields as low-dimensional topical vectors. LightLDA is an extremely efficient implementation of LDA that incorporates a number of optimization techniques. With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million words vocabulary on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters). The most significant innovation is a super-efficient $O(1)$. Metropolis-Hastings sampling algorithm, whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other Gibbs samplers.

In an ML.NET pipeline, this estimator requires the output of some preprocessing, as its input. A typical pipeline operating on text would require text normalization, tokenization and producing n-grams to supply to the LDA estimator. See the example usage in the See Also section for usage suggestions.

If we have the following three examples of text, as data points, and use the LDA transform with the number of topics set to 3, we would get the results displayed in the table below. Example documents:

  • I like to eat bananas.
  • I eat bananas everyday.
  • First celebrated in 1970, Earth Day now includes events in more than 193 countries, which are now coordinated globally by the Earth Day Network.

Notice the similarity in values of the first and second row, compared to the third, and see how those values are indicative of similarities between those two (small) bodies of text.

Topic1 Topic2 Topic 3
0.5714 0.0000 0.4286
0.5714 0.0000 0.4286
0.2400 0.3200 0.4400

For more technical details you can consult the following papers.

Check the See Also section for links to usage examples.

Methods

Fit(IDataView)

Trains and returns a LatentDirichletAllocationTransformer.

GetOutputSchema(SchemaShape)

Returns the SchemaShape of the schema which will be produced by the transformer. Used for schema propagation and verification in a pipeline.

Extension Methods

AppendCacheCheckpoint<TTrans>(IEstimator<TTrans>, IHostEnvironment)

Append a 'caching checkpoint' to the estimator chain. This will ensure that the downstream estimators will be trained against cached data. It is helpful to have a caching checkpoint before trainers that take multiple data passes.

WithOnFitDelegate<TTransformer>(IEstimator<TTransformer>, Action<TTransformer>)

Given an estimator, return a wrapping object that will call a delegate once Fit(IDataView) is called. It is often important for an estimator to return information about what was fit, which is why the Fit(IDataView) method returns a specifically typed object, rather than just a general ITransformer. However, at the same time, IEstimator<TTransformer> are often formed into pipelines with many objects, so we may need to build a chain of estimators via EstimatorChain<TLastTransformer> where the estimator for which we want to get the transformer is buried somewhere in this chain. For that scenario, we can through this method attach a delegate that will be called once fit is called.

Applies to

See also