Compute Elementary Statistics
Calculates specified summary statistics for selected dataset columns
Category: Statistical Functions
Applies to: Machine Learning Studio
This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.
This article describes how to use the Compute Elementary Statistics module in Azure Machine Learning Studio, to generate a summary report for your dataset that lists key statistics such as mean, standard deviation, and the range of values for each of the selected columns.
This report is useful for analyzing the central tendency, dispersion, and shape of data.
How to configure Compute Elementary Statistics
Connect a dataset that contains the columns you want to analyze.
Click the Method dropdown list, and choose the type of value that you want to calculate for each column.
See the Supported Statistics section for a full list of available statistics and what they mean.
By default, the value you selected in Method dropdown list will be calculated for all columns in the dataset that have a numeric data type. If any column has values that prevent the value from being calculated, an error will be raised and the report will not be created.
To avoid this error, use the column selector to pick the numeric columns for which you want a report. All columns that you choose must be numeric.
Run the experiment.
The generated report includes the name of each column and the statistic that was calculated. For example, the following table shows statistics generated for the mpg column.
Each time you run Compute Elementary Statistics, it can generate only a single summary statistic for each of the selected columns. However, you can use the Add Columns or Add Rows modules to merge the results into a single table, as in the preceding example.
This module supports the following standard descriptive statistics.
Calculates the squared deviation of the column values. Also known as the sum of squares.
Squared deviation is a measure of how far values are dispersed from the mean.
Calculates the geometric mean of the column values.
The geometric mean can be used to measure the central tendency of a set of numbers. Compared to the arithmetic mean, it is less affected by a small number of extreme values. It can also be used to compare measurements on different scales, since it effectively normalizes the scales of the numbers being compared. The geometric means is sometimes used to estimate compound annual growth rates.
The equivalent function in Excel is GEOMEAN.
Calculates the harmonic mean of the column values.
To compute the harmonic mean, all values are converted to their reciprocals, and then the mean is taken of those values. The harmonic mean is the reciprocal of that mean. If the column values are positive, larger numbers are weighted less than smaller numbers.
The harmonic mean is always less than the geometric mean, which is always less than the arithmetic mean. The harmonic mean is useful for averaging variables that represent rates, such as speed (distance over time) or sales per quarter.
The equivalent function in Excel is HARMEAN.
Calculates the interquartile difference for the first and the last quartiles of the column values. Also called the quartile range. When the quartile falls between two numbers, the quartile value is the average of the two values on either side of the cut.
The quartile value divides the column of values into four groups with an equal number of values. Thus, one quarter of the values are less than or equal to the 25th percentile. Three quarters of the values are less than or equal to the 75th percentile. By reviewing the quartile range you can get an idea of how widely spread the data values are.
K-th central moment
Calculates K-th central moment for the column values.
When calculating K-th central moment, you must also specify the Order, meaning the value of k. The value of k can range from 0 to any allowed integer value, though higher order values are generally not meaningful.
Generally, in descriptive statistics, a moment is a measure that describes the shape of a set of points. Central moments are moments about the mean, which are usually used because they provide better information about the distribution's shape. An order of 2 usually represents the variance; an order of 4 is used for kurtosis. The first order moment is the mean. Thus the collection of all moments uniquely describes the distribution of values in the column.
Finds the maximum value in the column.
Calculates the arithmetic mean of the column values.
The equivalent function in Excel is AVERAGE.
Calculates the mean absolute deviation for the column values.
That is, the mean is computed for the column, and the deviation computed for each value in the column. The average of the absolute values of the individual deviation values is the mean deviation.
This statistic tells you how spread out from the mean your column of numbers is.
Returns the median of the column values.
The median is the number in the middle of a column of numbers. If there is an even number of numbers in the column, the median is the average of the two numbers in the middle.
The median, together with the mean and the mode, is one of three statistics that measures central tendency. If the values are symmetrical around the mean, the three numbers will be about the same. However, the median is more robust to outliers than the mean.
Calculates the median deviation for the column.
That is, the median is computed for the column, and the deviation computed for each value in the column. The median value of the absolute values of the individual deviation values is taken.
The median absolute deviation is also known as MAD, and is used to describe the variability of a sample of numbers. MAD tells you how spread out from the mean your column of numbers is.
Returns the minimum value of the column values.
Finds all modes for the column.
The mode is the value that appears the most in the column. If several values appear the same number of times, the column can have multiple modes.
As a measure of central tendency, mode is more robust to outliers than the mean, and can be used with nominal data too.
Population standard deviation
Calculates the population standard deviation for the column values.
This statistic assumes that the column values represent the entire population. If your data is only a sample of the population, you must compute the standard deviation by using Sample standard deviation. However, in large datasets, the two statistics return approximately equal values.
The standard deviation is computed as the square root of the column variance. This statistic captures the amount of variability in the column.
Calculates the population variance for the column values.
Variance measures how much a set of numbers is spread out. If variance is zero, all numbers are the same.
This statistic assumes that the column of values represents the entire population. If your data contains only a sample of the values, you should compute variance by using Sample variance.
The equivalent Excel function is
Calculates the product of the column's elements.
To get the product, you multiple all the numbers in the column. The result is not in itself useful as a descriptive statistic but the function is useful for a variety of other calculations.
Calculates the range of the column values. The range is defined as the maximum value minus the minimum value
Calculates the sample kurtosis for the column values.
Kurtosis describes the shape of the distribution of values-- that is, how peaked or flat the distribution of values is, compared with the normal distribution.
The normal distribution has a kurtosis of 0.
High kurtosis values indicate that the probability mass is concentrated either around a peak, or in the tail of the distribution.
Negative kurtosis values indicate a relatively flat distribution.
Calculates the sample skewness for the column values.
Skew describes whether the bulk of the values are at the center, shifted to the left, or shifted to the right. Two distributions might have the same mean and standard deviation, yet be shaped very differently. You can use skewness and kurtosis to characterize the shape.
Negative skew values means the distribution is skewed to the left.
0 denotes the normal distribution.
Positive skewness values mean the distribution is skewed to the right.
Sample standard deviation
Calculates the sample standard deviation for the column values.
The standard deviation of the sample measures how spread out the values in the column are from the mean. It represents the average distance between the values of the data in the set and the mean.
This statistic assumes that the column values represent a sample of the population. If your data represents the entire population, you must compute the standard deviation using Population standard deviation.
The equivalent Excel function is ST.DEV.S.
Calculates the sample variance for the column values.
This method assumes that the column values represent a sample of the population. If the column contains the entire population, you should use Population standard variance.
The equivalent Excel function is VAR.S.
Calculates the sum of the column values.
The following experiments in the Azure AI Gallery demonstrate how you can create a summary report that contains descriptive statistics for an entire dataset. The summary report contains only general statistics; however, you can save it as a dataset and then add more detailed statistics, using the options in Compute Elementary Statistics.
This section contains implementation details, tips, and answers to frequently asked questions.
The following conditions must be satisfied when using the Compute Elementary Statistics module:
- There must be a sufficient number of data points (rows) to compute the selected statistic. For example, to compute Sample standard deviation requires at least two data points; otherwise, the result is NaN.
- Input columns must be numeric or Boolean.
By default, all numeric columns are selected. However, if any numeric columns are marked as categorical, you might get the following error: " Error 0056: Column with name <column name> is not in an allowed category." To correct the error, add an instance of the Edit Metadata module, select the column with the problem, and use the option Remove categorical.
Boolean columns are processed as follows:
MIN is computed as logical AND.
MAX is computed as logical OR.
RANGE checks whether the number of unique values in the column equals 2.
Missing values are ignored.
For statistics that require floating-point calculations, True = 1.0 and False = 0.0
|Dataset||Data Table||Input dataset|
|Method||List||Elementary statistics method||Selects a statistical method to use in calculations. See How to use section for list of values.|
|Column set||any||ColumnSelection||NumericAll||Selects the columns for which to calculate the statistic|
|Order||>=1||Integer||3||Specifies a value for central moment order (used for the kth central moment only)|
|Results dataset||Data Table||Output dataset|
|Error 0017||Exception occurs if one or more specified columns have a type that is unsupported by the current module.|
For a list of errors specific to Studio modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.