Microsoft.Spark.ML.Feature Namespace

Reference

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Classes

Bucketizer	Bucketizer maps a column of continuous features to a column of feature buckets. Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.
CountVectorizer
CountVectorizerModel
FeatureBase<T>	FeatureBase is to share code amongst all of the ML.Feature objects, there are a few interfaces that the Scala code implements across all of the objects. This should help to write the extra objects faster.
FeatureHasher
HashingTF	A HashingTF Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
IDF	Inverse document frequency (IDF). The standard formulation is used: idf = log((m + 1) / (d(t) + 1)), where m is the total number of documents and d(t) is the number of documents that contain term t. This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variable minDocFreq). For terms that are not in at least minDocFreq documents, the IDF is found as 0, resulting in TF-IDFs of 0.
IDFModel	A IDFModel that converts the input string to lowercase and then splits it by white spaces.
Tokenizer	A Tokenizer that converts the input string to lowercase and then splits it by white spaces.
Word2Vec
Word2VecModel

Interfaces

Feedback

Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see: https://aka.ms/ContentUserFeedback.

Submit and view feedback for

This product This page

View all page feedback