Principal Component Analysis

Article
05/06/2019

Important

Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.

Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.

See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
Learn more about Azure Machine Learning.

ML Studio (classic) documentation is being retired and may not be updated in the future.

Computes a set of features with reduced dimensionality for more efficient learning

Category: Data Transformation / Sample and Split

Note

Applies to: Machine Learning Studio (classic) only

Similar drag-and-drop modules are available in Azure Machine Learning designer.

Module overview

This article describes how to use the Principal Component Analysis module in Machine Learning Studio (classic) to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features.

The module also creates a transformation that you can apply to new data, to achieve a similar reduction in dimensionality and compression of features, without requiring additional training.

More about Principal Component Analysis

Principal Component Analysis (PCA) is a popular technique in machine learning. It relies on the fact that many types of vector-space data are compressible, and that compression can be most efficiently achieved by sampling.

Added benefits of PCA are improved data visualization, and optimization of resource use by the learning algorithm.

The Principal Component Analysis module in Machine Learning Studio (classic) takes a set of feature columns in the provided dataset, and creates a projection of the feature space that has lower dimensionality. The algorithm uses randomization techniques to identify a feature subspace that captures most of the information in the complete feature matrix. Hence, the transformed data matrices capture the variance in the original data while reducing the effect of noise and minimizing the risk of overfitting.

For general information about principal component analysis (PCA) see this Wikipedia article. For information about the PCA approaches used in this module, see these articles:

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. Halko, Martinsson, and Tropp, 2010.
Combining Structured and Unstructured Randomness in Large Scale PCACombining Structured and Unstructured Randomness in Large Scale PCA. Karampatziakis and Mineiro, 2013.

How to configure Principal Component Analysis

Add the Principal Component Analysis module to your experiment. You can find it in under Data Transformation, in the Scale and Reduce category.
Connect the dataset you want to transform, and choose the feature columns to analyze.

If it is not already clear which columns are features and which are labels, we recommend that you use the Edit Metadata module to mark the columns in advance.
Number of dimensions to reduce to: Type the desired number of columns in the final output. Each column represents a dimension capturing some part of the information in the input columns.

For example, if the source dataset has eight columns and you type 3, three new columns are returned that capture the information of the eight selected columns. The columns are named Col1, Col2, and Col3. These columns do not map directly to the source columns; instead, the columns contain an approximation of the feature space described by the original columns 1-8.

Tip

The algorithm functions optimally when the number of reduced dimensions is much smaller than the original dimensions.
Normalize dense dataset to zero mean: Select this option if the dataset is dense, meaning it contains few missing values. If selected, the module normalizes the values in the columns to a mean of zero before any other processing.

For sparse datasets, this option should not be selected. If a sparse dataset is detected, the parameter is overridden.
Run the experiment.

Results

The module outputs a reduced set of columns that you can use in creating a model. You can save the output as a new dataset or use it in your experiment.

Optionally, you can save the analysis process as a saved transform, to apply to another dataset using Apply Transformation.

The dataset you apply the transformation to must have the same schema as the original dataset.

Examples

For examples of how Principal Component Analysis is used in machine learning, see the Azure AI Gallery:

Clustering: Find Similar Companies: Uses Principal Component Analysis to reduce the number of values from text mining to a manageable number of features.

Although in this sample PCA is applied using a custom R script, it illustrates how PCA is typically used.

Technical notes

There are two stages to computation of the lower-dimensional components.

The first is to construct a low-dimensional subspace that captures the action of the matrix.
The second is to restrict the matrix to the subspace and then compute a standard factorization of the reduced matrix.

Expected inputs

Name	Type	Description
Dataset	Data Table	Dataset whose dimensions are to be reduced

Module parameters

Name	Type	Range	Optional	Description	Default
Selected columns	ColumnSelection		Required		Selected columns to apply PCA to
Number of dimensions to reduce to	Integer	>=1	Required		The number of desired dimensions in the reduced dataset
Normalize dense dataset to zero mean	Boolean		Required	true	Indicate whether the input columns will be mean normalized for dense datasets (for sparse data parameter is ignored)

Outputs

Name	Type	Description
Results dataset	Data Table	Dataset with reduced dimensions
PCA Transformation	ITransform interface	Transformation which when applied to dataset will give new dataset with reduced dimensions

Exceptions

Exception	Description
Error 0001	Exception occurs if one or more specified columns of data set couldn't be found.
Error 0003	Exception occurs if one or more of inputs are null or empty.
Error 0004	Exception occurs if parameter is less than or equal to specific value.

For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.