OneHotVectorizer Class

Categorical transform that can be performed on data before training a model.

Inheritance
nimbusml.internal.core.feature_extraction.categorical._onehotvectorizer.OneHotVectorizer
OneHotVectorizer
nimbusml.base_transform.BaseTransform
OneHotVectorizer
sklearn.base.TransformerMixin
OneHotVectorizer

Constructor

OneHotVectorizer(max_num_terms=1000000, output_kind='Indicator', term=None, sort='ByOccurrence', text_key_values=True, columns=None, **params)

Parameters

columns

a dictionary of key-value pairs, where key is the output column name and value is the input column name.

  • Multiple key-value pairs are allowed.

  • Input column type: numeric or string.

  • Output column type:

    Vector Type.

  • If the output column names are same as the input column names, then

simply specify columns as a list of strings.

The << operator can be used to set this value (see Column Operator)

For example

  • OneHotVectorizer(columns={'out1':'input1', 'out2':'input2'})

  • OneHotVectorizer() << {'out1':'input1', 'out2':'input2'}

For more details see Columns.

max_num_terms

An integer that specifies the maximum number of categories to include in the dictionary. The default value is 1000000.

output_kind

A character string that specifies the kind of output kind.

  • "Bag": Outputs a multi-set vector. If the input column is a vector of categories, the output contains one vector, where the

value in each slot is the number of occurrences of the category in the input vector. If the input column contains a single category, the

indicator vector and the bag vector are equivalent

  • "Ind": Outputs an indicator vector. The input column is a

vector of categories, and the output contains one indicator vector per

slot in the input column.

  • "Key": Outputs an index. The output is an integer id (between 1 and the number of categories in the dictionary) of the category.

  • "Bin": Outputs a vector which is the binary representation of

the category.

The default value is "Ind".

term

Optional character vector of terms or categories.

sort

A character string that specifies the sorting criteria.

  • "Occurrence": Sort categories by occurences. Most frequent is

first.

  • "Value": Sort categories by values.
text_key_values

Whether key value metadata should be text, regardless of the actual input type.

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # OneHotVectorizer
   from nimbusml import FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.categorical import OneHotVectorizer

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()
   data = FileDataStream.read_csv(
       path, sep=',', dtype={
           'spontaneous': str})  # Error with numeric input for ohhv
   print(data.head())
   #    age  case education  induced  parity ... row_num  spontaneous  ...
   # 0   26     1    0-5yrs        1       6 ...       1            2  ...
   # 1   42     1    0-5yrs        1       1 ...       2            0  ...
   # 2   39     1    0-5yrs        2       6 ...       3            0  ...
   # 3   34     1    0-5yrs        2       4 ...       4            0  ...
   # 4   35     1   6-11yrs        1       3 ...       5            1  ...

   # transform usage
   # set output_kind = "Bag" to featurize slots independently for vector type
   # columns
   xf = OneHotVectorizer(
       columns={
           'edu': 'education',
           'in': 'induced',
           'sp': 'spontaneous'})

   # fit and transform
   features = xf.fit_transform(data)
   print(features.head())

   #    age  case  edu.0-5yrs  edu.12+ yrs  edu.6-11yrs education   in.0  in.1 ...
   # 0   26     1         1.0          0.0          0.0    0-5yrs    0.0   1.0 ...
   # 1   42     1         1.0          0.0          0.0    0-5yrs    0.0   1.0 ...
   # 2   39     1         1.0          0.0          0.0    0-5yrs    0.0   0.0 ...

Remarks

The OneHotVectorizer transform passes through a data set, operating on text columns, to build a dictionary of categories. For each row, the entire text string appearing in the input column is defined as a category. The output of the categorical transform is an indicator vector. Each slot in this vector corresponds to a category in the dictionary, so its length is the size of the built dictionary. The categorical transform can be applied to one or more columns, in which case it builds a separate dictionary for each column that it is applied to.

OneHotVectorizer is not currently supported to handle factor data.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

deep
default value: False