PcaAnomalyDetector Class
Train an anomaly model using approximate PCA via randomized SVD algorithm
- Inheritance
-
nimbusml.internal.core.decomposition._pcaanomalydetector.PcaAnomalyDetectorPcaAnomalyDetectornimbusml.base_predictor.BasePredictorPcaAnomalyDetectorsklearn.base.ClassifierMixinPcaAnomalyDetector
Constructor
PcaAnomalyDetector(normalize='Auto', caching='Auto', rank=20, oversampling=20, center=True, random_state=None, feature=None, weight=None, **params)
Parameters
- feature
see Columns.
- weight
see Columns.
- normalize
Specifies the type of automatic normalization used:
"Auto"
: if normalization is needed, it is performed automatically. This is the default choice."No"
: no normalization is performed."Yes"
: normalization is performed."Warn"
: if normalization is needed, a warning message is displayed, but normalization is not performed.
Normalization rescales disparate data ranges to a standard scale.
Feature
scaling insures the distances between data points are proportional
and
enables various optimization methods such as gradient descent to
converge
much faster. If normalization is performed, a MaxMin
normalizer
is
used. It normalizes values in an interval [a, b] where -1 <= a <= 0
and 0 <= b <= 1
and b - a = 1
. This normalizer preserves
sparsity by mapping zero to zero.
- caching
Whether trainer should cache input training data.
- rank
The number of components in the PCA.
- oversampling
Oversampling parameter for randomized PCA training.
- center
If enabled, data is centered to be zero mean.
- random_state
The seed for random number generation.
- params
Additional arguments sent to compute engine.
Examples
###############################################################################
# PcaAnomalyDetector
from nimbusml import Pipeline, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.decomposition import PcaAnomalyDetector
from nimbusml.feature_extraction.categorical import OneHotVectorizer
# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()
data = FileDataStream.read_csv(path)
print(data.head())
# age case education induced parity ... row_num spontaneous ...
# 0 26 1 0-5yrs 1 6 ... 1 2 ...
# 1 42 1 0-5yrs 1 1 ... 2 0 ...
# 2 39 1 0-5yrs 2 6 ... 3 0 ...
# 3 34 1 0-5yrs 2 4 ... 4 0 ...
# 4 35 1 6-11yrs 1 3 ... 5 1 ...
# define the training pipeline
pipeline = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
PcaAnomalyDetector(rank=3, feature=['induced', 'edu'])
])
# train, predict, and evaluate
metrics, predictions = pipeline.fit(data).test(
data, 'case', output_scores=True)
# Score
# 0 0.026155
# 1 0.026155
# 2 0.018055
# 3 0.018055
# 4 0.004043
# print predictions
print(predictions.head())
# print evaluation metrics
print(metrics)
# AUC DR @K FP DR @P FPR DR @NumPos Threshold @K FP ...
# 0 0.547718 0.084337 0 0.433735 0.009589 ...
Remarks
PcaAnomalyDetector
uses an approximate SVD decomposition of the
data covariance matrix
to find the principal components (eigenvectors), using a randomized
algorithm to allow for
efficient factorization of large datasets.
PCA results in a low-rank approximation of a matrix containing the data to be analyzed. Since most of the variance in the data is captured in the subspace spanned by the principal components, the distance to the subspace can be used as a measure to detect outlier instances.
The rank argument used to specify how many of the of largest principal components to use to approximate the final data matrix. A larger score at prediction time indicates that the instance is further away from the expected distance, and is more likely to be an outlier.
Normalization of the dimensions (columns) is required, and by default is turned on. Setting the normalize argument to No will therefore result in poor performance.
Reference
Randomized Methods for Computing the Singular Value Decomposition (SVD) of very large matrices A randomized algorithm for principal component analysis, Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
- deep