GamBinaryClassifier Class

Generalized Additive Models

Inheritance
nimbusml.internal.core.ensemble._gambinaryclassifier.GamBinaryClassifier
GamBinaryClassifier
nimbusml.base_predictor.BasePredictor
GamBinaryClassifier
sklearn.base.ClassifierMixin
GamBinaryClassifier

Constructor

GamBinaryClassifier(number_of_iterations=9500, minimum_example_count_per_leaf=10, learning_rate=0.002, normalize='Auto', caching='Auto', unbalanced_sets=False, entropy_coefficient=0.0, gain_conf_level=0, number_of_threads=None, disk_transpose=None, maximum_bin_count_per_feature=255, maximum_tree_output=inf, get_derivatives_sample_rate=1, random_state=123, feature_flocks=True, enable_pruning=True, feature=None, label=None, weight=None, **params)

Parameters

feature

see Columns.

label

see Columns.

weight

see Columns.

number_of_iterations

Total number of iterations over all features.

minimum_example_count_per_leaf

Minimum number of training instances required to form a leaf. That is, the minimal number of documents allowed in a leaf of regression tree, out of the sub-sampled data. A 'split' means that features in each level of the tree (node) are randomly divided.

learning_rate

Determines the size of the step taken in the direction of the gradient in each step of the learning process. This determines how fast or slow the learner converges on the optimal solution. If the step size is too big, you might overshoot the optimal solution. If the step size is too small, training takes longer to converge to the best solution.

normalize

Specifies the type of automatic normalization used:

  • "Auto": if normalization is needed, it is performed automatically. This is the default choice.

  • "No": no normalization is performed.

  • "Yes": normalization is performed.

  • "Warn": if normalization is needed, a warning message is displayed, but normalization is not performed.

Normalization rescales disparate data ranges to a standard scale. Feature scaling insures the distances between data points are proportional and enables various optimization methods such as gradient descent to converge much faster. If normalization is performed, a MaxMin normalizer is used. It normalizes values in an interval [a, b] where -1 <= a <= 0 and 0 <= b <= 1 and b - a = 1. This normalizer preserves sparsity by mapping zero to zero.

caching

Whether trainer should cache input training data.

unbalanced_sets

Should we use derivatives optimized for unbalanced sets.

entropy_coefficient

The entropy (regularization) coefficient between 0 and 1.

gain_conf_level

Tree fitting gain confidence requirement (should be in the range [0,1) ).

number_of_threads

The number of threads to use.

disk_transpose

Whether to utilize the disk or the data's native transposition facilities (where applicable) when performing the transpose.

maximum_bin_count_per_feature

Maximum number of distinct values (bins) per feature.

maximum_tree_output

Upper bound on absolute value of single output.

get_derivatives_sample_rate

Sample each query 1 in k times in the GetDerivatives function.

random_state

The seed of the random number generator.

feature_flocks

Whether to collectivize features during dataset preparation to speed up training.

enable_pruning

Enable post-training pruning to avoid overfitting. (a validation set is required).

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # GamBinaryClassifier
   from nimbusml import Pipeline, FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.ensemble import GamBinaryClassifier
   from nimbusml.feature_extraction.categorical import OneHotVectorizer

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()
   data = FileDataStream.read_csv(path)
   print(data.head())
   #   age  case education  induced  parity  ... row_num  spontaneous  ...
   # 0   26     1    0-5yrs        1       6 ...       1            2  ...
   # 1   42     1    0-5yrs        1       1 ...       2            0  ...
   # 2   39     1    0-5yrs        2       6 ...       3            0  ...
   # 3   34     1    0-5yrs        2       4 ...       4            0  ...
   # 4   35     1   6-11yrs        1       3 ...       5            1  ...

   # define the training pipeline
   pipeline = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       GamBinaryClassifier(feature=['age', 'edu'], label='case')
   ])

   # train, predict, and evaluate
   metrics, predictions = pipeline.fit(data).test(data, output_scores=True)

   # print predictions
   print(predictions.head())
   #   PredictedLabel     Score
   # 0               0 -0.050461
   # 1               0 -0.049737
   # 2               0 -0.049737
   # 3               0 -0.050461
   # 4               0 -0.050552
   # print evaluation metrics
   print(metrics)
   #        AUC  Accuracy  Positive precision  Positive recall  ...
   # 0  0.502957  0.665323                   0                0  ...

Remarks

Generalized additive models (referred to throughout as GAM) is a class of models expressable as an independent sum of individual functions. nimbusml's GAM learner comes in both binary classification (using logit-boosting) and regression (using least squares) flavors.

In contrast to many formal definitions of GAM, this implementation found it convenient to represent learning over stepwise functions, which betrays the intention that GAM's components be smooth functions. In particular: the learner first discretizes features, and the "step" functions learned will step between the discretization boundaries.

This implementation is based on the this paper, but diverges from it in several important respects: most significantly, in each round of boosting, rather than do one feature at a time, it instead makes a round on all features simultaneously. In each round, it will choose only one split point of each feature to change.

In its current form, the GAM learner has the following advantages and disadvantages: on the one hand, they offer ready interpretability combined with expressive power, but on the other, they are currently slow. We would recommend their usage in the case where the key criteria is interpretability.

Let's talk a bit more about interpretabilty. The next most interpretable model, we might say, is a linear model. But really, let's say that you have a feature with a coefficient of 3.9293, or something. What do you know? You know that generally, perhaps, larger values for that feature are "better." Great. But is 4 better than 3? Is 5 better than 4? To what degree? Are there "shapes" in the distributions hidden because of the reduction of a complex quantity to a single values? These are questions a linear model fundamentally cannot answer, but a GAM model might.

Reference

Generalized additive models, Intelligible Models for Classification and Regression

Methods

decision_function

Returns score values

get_params

Get the parameters for this operator.

predict_proba

Returns probabilities

decision_function

Returns score values

decision_function(X, **params)

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

deep
default value: False

predict_proba

Returns probabilities

predict_proba(X, **params)