Binner Class

Normalizes columns as specified below.

Inheritance
nimbusml.internal.core.preprocessing.normalization._binner.Binner
Binner
nimbusml.base_transform.BaseTransform
Binner
sklearn.base.TransformerMixin
Binner

Constructor

Binner(num_bins=1024, fix_zero=True, max_training_examples=1000000000, columns=None, **params)

Parameters

columns

a dictionary of key-value pairs, where key is the output column name and value is the input column name.

  • Multiple key-value pairs are allowed.

  • Input column type: numeric or

    Vector Type.

  • Output column type: numeric or

    Vector Type.

  • If the output column names are same as the input column names, then

simply specify columns as a list of strings.

The << operator can be used to set this value (see Column Operator)

For example

  • Binner(columns={'out1':'input1', 'out2':'input2'})

  • Binner() << {'out1':'input1', 'out2':'input2'}

For more details see Columns.

num_bins

Max number of bins, power of 2 recommended.

fix_zero

Whether to map zero to zero, preserving sparsity.

max_training_examples

Max number of examples used to train the normalizer.

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # Binner
   import numpy
   from nimbusml import FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.preprocessing.normalization import Binner

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()
   data = FileDataStream.read_csv(
       path,
       sep=',',
       numeric_dtype=numpy.float32)  # Error with integer input
   print(data.head())
   #    age  case education  induced  parity  pooled.stratum  row_num  ...
   # 0  26.0   1.0    0-5yrs      1.0     6.0             3.0      1.0  ...
   # 1  42.0   1.0    0-5yrs      1.0     1.0             1.0      2.0  ...
   # 2  39.0   1.0    0-5yrs      2.0     6.0             4.0      3.0  ...
   # 3  34.0   1.0    0-5yrs      2.0     4.0             2.0      4.0  ...
   # 4  35.0   1.0   6-11yrs      1.0     3.0            32.0      5.0  ...

   xf = Binner(columns={'in': 'induced', 'sp': 'spontaneous'})

   # fit and transform
   features = xf.fit_transform(data)

   # print features
   print(features.head())
   #    age  case education   in  induced  parity  ... row_num   sp  ...
   # 0  26.0   1.0    0-5yrs  0.5      1.0     6.0 ...     1.0  1.0  ...
   # 1  42.0   1.0    0-5yrs  0.5      1.0     1.0 ...     2.0  0.0  ...
   # 2  39.0   1.0    0-5yrs  1.0      2.0     6.0 ...     3.0  0.0  ...
   # 3  34.0   1.0    0-5yrs  1.0      2.0     4.0 ...     4.0  0.0  ...
   # 4  35.0   1.0   6-11yrs  0.5      1.0     3.0 ...     5.0  0.5  ...

Remarks

In linear classification algorithms instances are viewed as vectors in multi-dimensional space. Since the range of values of raw data varies widely, some objective functions do not work properly without normalization. For example, if one of the features has a broad range of values, the distances between points is governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. This can provide significant speedup and accuracy benefits. In all the linear algorithms in nimbusml (LogisticRegressionClassifier, AveragedPerceptronBinaryClassifier, etc.), the default is to normalize features before training.

Binner creates equi-density bins, and then normalizes every value in the bin to be divided by the total number of bins. The number of bins the normalizer uses can be defined by the user, and the default is 1000.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

deep
default value: False