PCA-Based Anomaly Detection
Creates an anomaly detection model using Principal Component Analysis
Category: Anomaly Detection
This article describes how to use the PCA-Based Anomaly Detection module in Azure Machine Learning Studio, to create an anomaly detection model based on Principal Component Analysis (PCA).
This module helps you build a model in scenarios where it is easy to obtain training data from one class, such as valid transactions, but difficult to obtain sufficient samples of the targeted anomalies.
For example, to detect fraudulent transactions, very often you don't have enough examples of fraud to train on, but have many examples of good transactions. The PCA-Based Anomaly Detection module solves the problem by analyzing available features to determine what constitutes a "normal" class, and applying distance metrics to identify cases that represent anomalies. This let you train a model using existing imbalanced data.
More about Principal Component Analysis
Principal Component Analysis, which is frequently abbreviated to PCA, is an established technique in machine learning. PCA is frequently used in exploratory data analysis because it reveals the inner structure of the data and explains the variance in the data.
PCA works by analyzing data that contains multiple variables. It looks for correlations among the variables and determines the combination of values that best captures differences in outcomes. These combined feature values are used to create a more compact feature space called the principal components.
For anomaly detection, each new input is analyzed, and the anomaly detection algorithm computes its projection on the eigenvectors, together with a normalized reconstruction error. The normalized error is used as the anomaly score. The higher the error, the more anomalous the instance is.
For additional information about how PCA works, and about the implementation for anomaly detection, see these papers:
A randomized algorithm for principal component analysis. Rokhlin, Szlan and Tygert
Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions (PDF download). Halko, Martinsson and Tropp.
How to configure PCA Anomaly Detection
Add the PCA-Based Anomaly Detection module to your experiment in Studio. You can find this module under Machine Learning, Initialize Model, in the Anomaly Detection category.
In the Properties pane for the PCA-Based Anomaly Detection module, click the Training mode option, and indicate whether you want to train the model using a specific set of parameters, or use a parameter sweep to find the best parameters.
Single Parameter: Select this option if you know how you want to configure the model, and provide a specific set of values as arguments.
Parameter Range: Select this option if you are not sure of the best parameters and want to use a parameter sweep, using the Tune Model Hyperparameters module. The trainer iterates over a range of settings you specify, and determines the combination of settings that produces the optimal results.
Number of components to use in PCA, Range for number of PCA components: Specify the number of output features, or components, that you want to output.
The decision of how many components to include is an important part of experiment design using PCA. General guidance is that you should not include the same number of PCA components as there are variables. Instead, you should start with some smaller number of components and increase them until some criteria is met.
If you are unsure of what the optimum value might be, we recommend that you train the anomaly detection model using the Parameter Range option.
The best results are obtained when the number of output components is less than the number of feature columns available in the dataset.
Specify the amount of oversampling to perform during randomized PCA training. In anomaly detection problems, imbalanced data makes it difficult to apply standard PCA techniques. By specifying some amount of oversampling, you can increase the number of target instances.
If you specify 1, no oversampling is performed. If you specify any value higher than 1, additional samples are generated to use in training the model.
There are two options, depending on whether you are using a parameter sweep or not:
- Oversampling parameter for randomized PCA: Type a single whole number that represents the ratio of oversampling of the minority class over the normal class. (Available when using the Single parameter training method.)
Range for the oversampling parameter used in randomized PCA: Type a series of numbers to try, or use the Range Builder to select values using a slider. (Available only when using the Parameter range training method.)
You cannot view the oversampled data set. For additional details of how oversampling is used with PCA, see Technical notes.
Enable input feature mean normalization: Select this option to normalize all input features to a mean of zero. Normalization or scaling to zero is generally recommended for PCA, because the goal of PCA is to maximize variance among variables.
This option is selected by default. Deselect this option if values have already been normalized using a different method or scale.
Connect a tagged training dataset, and one of the training modules:
- If you set the Create trainer mode option to Single Parameter, use the Train Model module.
If you set the Create trainer mode option to Parameter Range, use the Tune Model Hyperparameters module.
If you pass a parameter range to Train Model, it uses only the first value in the parameter range list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range of settings for each parameter, it ignores the values and using the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value is used throughout the sweep, even if other parameters change across a range of values.
Run the experiment, or select the module and click Run selected.
When training is complete, you can either save the trained model, or connect it to the Score Model module to predict anomaly scores.
To evaluate the results of an anomaly detection models requires some additional steps:
Ensure that a score column is available in both datasets
If you try to evaluate an anomaly detection model and get the error, "There is no score column in scored dataset to compare", it means you are using a typical evaluation dataset that contains a label column but no probability scores. You need to choose a dataset that matches the schema output for anomaly detection models, which includes a Scored Labels and Scored Probabilities column.
Ensure that label columns are marked
Sometimes the metadata associated with the label column is removed in the experiment graph. If this happens, when you use the Evaluate Model module to compare the results of two anomaly detection models, you might get the error, "There is no label column in scored dataset", or "There is no label column in scored dataset to compare".
Normalize scores from different model types
Predictions from the PCA anomaly detection model always are in the range [0,1]. In contrast, output from the One-Class SVM module are uncalibrated scores that are possibly unbounded.
Therefore, if you are comparing models based on different algorithms, you must always normalize scores. See the example in the Azure AI Gallery for an example of normalization among different anomaly detection models.
For examples of how PCA is used in anomaly detection, see the Azure AI Gallery:
- Anomaly detection: credit risk: Illustrates how to find outliers in data. This example uses a parameter sweep to find the optimal model. It then applies that model to new data to identify risky transactions that might represent fraud, comparing two different anomaly detection models.
This algorithm uses PCA to approximate the subspace containing the normal class. The subspace is spanned by eigenvectors associated with the top eigenvalues of the data covariance matrix. For each new input, the anomaly detector first computes its projection on the eigenvectors, and then computes the normalized reconstruction error. This error is the anomaly score. The higher the error, the more anomalous the instance. For details on how the normal space is computed, see Wikipedia: Principal Component Analysis
|Training mode||CreateLearnerMode||List:Single Parameter|Parameter Range||Required||Single Parameter||Specify learner options.
Use the SingleParameter option to manually specify all values.
Use the ParameterRange option to sweep over tunable parameters.
|Number of components to use in PCA||Integer||mode:Single Parameter||2||Specify the number of components to use in PCA.|
|Oversampling parameter for randomized PCA||Integer||mode:Single Parameter||2||Specify the accuracy parameter for randomized PCA training.|
|Enable input feature mean normalization||Logic type||List:True|False||Required||False||Specify if the input data is normalized to have zero mean.|
|Range for number of PCA components||ParameterRangeSettings||[1;100]||mode:Parameter Range||2; 4; 6; 8; 10||Specify the range for number of components to use in PCA.|
|Range for the oversampling parameter used in randomized PCA||ParameterRangeSettings||[1;100]||mode:Parameter Range||2; 4; 6; 8; 10||Specify the range for accuracy parameter used in randomized PCA training.|
|Untrained model||ILearner interface||An untrained PCA-based anomaly detection model|
|Error 0017||Exception occurs if one or more specified columns have type unsupported by current module.|
|Error 0062||Exception occurs when attempting to compare two models with different learner types.|
|Error 0047||Exception occurs if number of feature columns in some of the datasets passed to the module is too small.|
For a list of errors specific to Studio modules, see Machine Learning Error codes
For a list of API exceptions, see Machine Learning REST API Error Codes.