Microsoft.Spark.ML.Feature Namespace

Classes

Bucketizer

Bucketizer maps a column of continuous features to a column of feature buckets.

Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

CountVectorizer
CountVectorizerModel
FeatureBase<T>

FeatureBase is to share code amongst all of the ML.Feature objects, there are a few interfaces that the Scala code implements across all of the objects. This should help to write the extra objects faster.

FeatureHasher
HashingTF

A HashingTF Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

IDF

Inverse document frequency (IDF). The standard formulation is used: idf = log((m + 1) / (d(t) + 1)), where m is the total number of documents and d(t) is the number of documents that contain term t.

This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variable minDocFreq). For terms that are not in at least minDocFreq documents, the IDF is found as 0, resulting in TF-IDFs of 0.

IDFModel

A IDFModel that converts the input string to lowercase and then splits it by white spaces.

Tokenizer

A Tokenizer that converts the input string to lowercase and then splits it by white spaces.

Word2Vec
Word2VecModel

Interfaces

Identifiable