OneHotVectorizer Class
Categorical transform that can be performed on data before training a model.
- Inheritance
-
nimbusml.internal.core.feature_extraction.categorical._onehotvectorizer.OneHotVectorizerOneHotVectorizernimbusml.base_transform.BaseTransformOneHotVectorizersklearn.base.TransformerMixinOneHotVectorizer
Constructor
OneHotVectorizer(max_num_terms=1000000, output_kind='Indicator', term=None, sort='ByOccurrence', text_key_values=True, columns=None, **params)
Parameters
- columns
a dictionary of key-value pairs, where key is the output column name and value is the input column name.
Multiple key-value pairs are allowed.
Input column type: numeric or string.
Output column type:
If the output column names are same as the input column names, then
simply specify columns
as a list of strings.
The << operator can be used to set this value (see Column Operator)
For example
OneHotVectorizer(columns={'out1':'input1', 'out2':'input2'})
OneHotVectorizer() << {'out1':'input1', 'out2':'input2'}
For more details see Columns.
- max_num_terms
An integer that specifies the maximum number of categories to include in the dictionary. The default value is 1000000.
- output_kind
A character string that specifies the kind of output kind.
"Bag"
: Outputs a multi-set vector. If the input column is a vector of categories, the output contains one vector, where the
value in each slot is the number of occurrences of the category in the input vector. If the input column contains a single category, the
indicator vector and the bag vector are equivalent
"Ind"
: Outputs an indicator vector. The input column is a
vector of categories, and the output contains one indicator vector per
slot in the input column.
"Key"
: Outputs an index. The output is an integer id (between 1 and the number of categories in the dictionary) of the category."Bin"
: Outputs a vector which is the binary representation of
the category.
The default value is "Ind"
.
- term
Optional character vector of terms or categories.
- sort
A character string that specifies the sorting criteria.
"Occurrence"
: Sort categories by occurences. Most frequent is
first.
"Value"
: Sort categories by values.
- text_key_values
Whether key value metadata should be text, regardless of the actual input type.
- params
Additional arguments sent to compute engine.
Examples
###############################################################################
# OneHotVectorizer
from nimbusml import FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.categorical import OneHotVectorizer
# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()
data = FileDataStream.read_csv(
path, sep=',', dtype={
'spontaneous': str}) # Error with numeric input for ohhv
print(data.head())
# age case education induced parity ... row_num spontaneous ...
# 0 26 1 0-5yrs 1 6 ... 1 2 ...
# 1 42 1 0-5yrs 1 1 ... 2 0 ...
# 2 39 1 0-5yrs 2 6 ... 3 0 ...
# 3 34 1 0-5yrs 2 4 ... 4 0 ...
# 4 35 1 6-11yrs 1 3 ... 5 1 ...
# transform usage
# set output_kind = "Bag" to featurize slots independently for vector type
# columns
xf = OneHotVectorizer(
columns={
'edu': 'education',
'in': 'induced',
'sp': 'spontaneous'})
# fit and transform
features = xf.fit_transform(data)
print(features.head())
# age case edu.0-5yrs edu.12+ yrs edu.6-11yrs education in.0 in.1 ...
# 0 26 1 1.0 0.0 0.0 0-5yrs 0.0 1.0 ...
# 1 42 1 1.0 0.0 0.0 0-5yrs 0.0 1.0 ...
# 2 39 1 1.0 0.0 0.0 0-5yrs 0.0 0.0 ...
Remarks
The OneHotVectorizer
transform passes through a data set,
operating
on text columns, to build a dictionary of categories. For each row,
the entire text string appearing in the input column is defined as a
category. The output of the categorical transform is an indicator
vector.
Each slot in this vector corresponds to a category in the dictionary,
so
its length is the size of the built dictionary. The categorical
transform
can be applied to one or more columns, in which case it builds a
separate
dictionary for each column that it is applied to.
OneHotVectorizer
is not currently supported to handle factor
data.
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
- deep