PcaAnomalyDetector Class

Train an anomaly model using approximate PCA via randomized SVD algorithm

Inheritance
nimbusml.internal.core.decomposition._pcaanomalydetector.PcaAnomalyDetector
PcaAnomalyDetector
nimbusml.base_predictor.BasePredictor
PcaAnomalyDetector
sklearn.base.ClassifierMixin
PcaAnomalyDetector

Constructor

PcaAnomalyDetector(normalize='Auto', caching='Auto', rank=20, oversampling=20, center=True, random_state=None, feature=None, weight=None, **params)

Parameters

feature

see Columns.

weight

see Columns.

normalize

Specifies the type of automatic normalization used:

  • "Auto": if normalization is needed, it is performed automatically. This is the default choice.

  • "No": no normalization is performed.

  • "Yes": normalization is performed.

  • "Warn": if normalization is needed, a warning message is displayed, but normalization is not performed.

Normalization rescales disparate data ranges to a standard scale. Feature scaling insures the distances between data points are proportional and enables various optimization methods such as gradient descent to converge much faster. If normalization is performed, a MaxMin normalizer is used. It normalizes values in an interval [a, b] where -1 <= a <= 0 and 0 <= b <= 1 and b - a = 1. This normalizer preserves sparsity by mapping zero to zero.

caching

Whether trainer should cache input training data.

rank

The number of components in the PCA.

oversampling

Oversampling parameter for randomized PCA training.

center

If enabled, data is centered to be zero mean.

random_state

The seed for random number generation.

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # PcaAnomalyDetector
   from nimbusml import Pipeline, FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.decomposition import PcaAnomalyDetector
   from nimbusml.feature_extraction.categorical import OneHotVectorizer

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()

   data = FileDataStream.read_csv(path)
   print(data.head())
   #    age  case education  induced  parity ... row_num  spontaneous  ...
   # 0   26     1    0-5yrs        1       6 ...       1            2  ...
   # 1   42     1    0-5yrs        1       1 ...       2            0  ...
   # 2   39     1    0-5yrs        2       6 ...       3            0  ...
   # 3   34     1    0-5yrs        2       4 ...       4            0  ...
   # 4   35     1   6-11yrs        1       3 ...       5            1  ...

   # define the training pipeline
   pipeline = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       PcaAnomalyDetector(rank=3, feature=['induced', 'edu'])
   ])

   # train, predict, and evaluate
   metrics, predictions = pipeline.fit(data).test(
       data, 'case', output_scores=True)
   #      Score
   # 0  0.026155
   # 1  0.026155
   # 2  0.018055
   # 3  0.018055
   # 4  0.004043

   # print predictions
   print(predictions.head())

   # print evaluation metrics
   print(metrics)
   #        AUC  DR @K FP  DR @P FPR  DR @NumPos  Threshold @K FP  ...
   # 0  0.547718  0.084337          0    0.433735         0.009589  ...

Remarks

PcaAnomalyDetector uses an approximate SVD decomposition of the data covariance matrix to find the principal components (eigenvectors), using a randomized algorithm to allow for efficient factorization of large datasets.

PCA results in a low-rank approximation of a matrix containing the data to be analyzed. Since most of the variance in the data is captured in the subspace spanned by the principal components, the distance to the subspace can be used as a measure to detect outlier instances.

The rank argument used to specify how many of the of largest principal components to use to approximate the final data matrix. A larger score at prediction time indicates that the instance is further away from the expected distance, and is more likely to be an outlier.

Normalization of the dimensions (columns) is required, and by default is turned on. Setting the normalize argument to No will therefore result in poor performance.

Reference

Randomized Methods for Computing the Singular Value Decomposition (SVD) of very large matrices A randomized algorithm for principal component analysis, Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

deep
default value: False