# Evaluate Probability Function

*Fits a specified probability distribution function to a dataset*

Category: Statistical Functions

Note

**Applies to**: Machine Learning Studio

This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.

## Module overview

This article describes how to use the Evaluate Probability Function module in Azure Machine Learning Studio, to calculate statistical measures that describe a column’s distribution, such as the Bernoulli, Pareto, or Poisson distributions.

To use this model, connect a dataset that contains at least one column of numerical values, and choose a probability distribution to test. The module returns a data table that contains values from the specified probability function.

You can compute any of these values for the chosen probability distribution:

- cumulative distribution function (
**cdf**) - inverse cumulative distribution function (
**InverseCdf**) - probability density function (
**Pdf**)

### Why is the probability distribution useful?

When you evaluate your data against a probability distribution, you are mapping column values against a set of values with known properties. By knowing whether your data corresponds to one of these well-known distributions, you might be able infer other properties of your data. In general, you can get better predictions from a model when you can identify the distribution that fits the data best.

The question of which probability distribution function to use depends on the data and the variables that are being measured. For example, some distributions are designed to describe probabilities of discrete values; others are intended for use only with continuous numerical variables. For some distributions, you must also know in advance an expected mean, degrees of freedom, and so forth. For details, see Supported Probability Distributions

## How to configure Evaluate Probability Function

All options change depending on the type of probability distribution you want to compute. If you change the probability distribution method, other selections you might have made are reset.

Therefore, be sure to choose the

**Distribution**option first!The dataset used as input should contains numerical data. Other types of data are ignored.

For each analysis, you can apply a single probability distribution method. To compute a different probability distribution, add a separate instance of the module for each distribution you intend to test.

Add the Evaluate Probability Function module to your experiment. You can find this module in the Statistical Functions category in Azure Machine Learning Studio.

Connect a dataset that contains at least one column of numbers.

Use the

**Distribution**option to select the kind of probability distribution that you want to calculate. See Supported Probability Distributions for a list of options and their required arguments.Set any parameters that are required by the distribution.

Choose one of three statistics to create: the cumulative distribution function (

**cdf**), inverse cumulative distribution function (**InverseCdf)**, or Probability density function (**pdf**).See the Technical notes section for definitions.

Use the column selector to choose the columns over which to compute the selected probability distribution.

All the columns you select must have a numerical data type.

The range of data in the column must also be valid, given the selected probability function. Otherwise, an error or NaN result may occur.

For sparse columns, any values that correspond to background zeros will not be processed.

Use the

**Result mode**option to specify how to output the results. You can replace column values with the probability distribution values, append the new values to the dataset, or return only the probability distribution values.Run the experiment, or right-click the Evaluate Probability Function module and click

**Run selected**.

### Results

The following table contains a example of results, using the **Append** option, on a single temperature column from the **Forest Fires** sample dataset.

temp | StandardNormal.Cdf(temp) | StandardNormal.Pdf(temp) | FFisher.cdf(temp | FFisher.cdf(temp |
---|---|---|---|---|

8.2 | 1 | 1 | 0.984774 | 0.004349 |

18 | 1 | 1 | 0.997896 | 0.000311 |

14.6 | 1 | 1 | 0.996352 | 0.000648 |

8.3 | 1 | 1 | 0.985201 | 0.004187 |

11.4 | 1 | 1 | 0.993147 | 0.001502 |

The headings of the generated columns contain the probability distribution that was used.

If you are not sure which probability distribution is likely to suit your data, you can create a quick chart of cumulative distribution and probability density for any numeric column.

- Right-click the dataset or module output, and select
**Visualize**. - Select the column of interest, and in the
**Histogram**pane, select**cumulative distribution**or**probability density**. - A chart of the distribution, like the following, is superimposed on the histogram representing the data.

## Supported probability distributions

The Evaluate Probability Function module supports the following distributions:

### Bernoulli

The Bernoulli distribution is a distribution over binary values: in other words, it models the expected distribution when only two values are possible.

To calculate, select **Bernoulli**, and set the following options:

**Probability of success**

The parameter*p*specifies the probability that a 1 is generated. Type a number (`float`

) between 0.0 and 1.0 that specifies the probability of success. The default is .5.

### Beta

The Beta distribution is a continuous univariate distribution.

To calculate, select **Beta**, and set the following options:

**Shape**

Type a value to change the shape of the distribution.A shape parameter is any parameter of a probability distribution that does not define its location or scale. Therefore, when you enter a value for shape, the parameter changes the shape of the distribution rather than moving, stretching, or shrinking it.

The value must be a number (

`double`

). The default is 1.0.**Scale**

Type a number to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Upper bound**

Type a number (`double`

) that represents the upper bound of the distribution. The default is 1.0.**Lower bound**

Type a number (`double`

) that represents the lower bound of the distribution. The default is 0.0.

### Binomial

The binomial distribution is a discrete univariate distribution. The binomial distribution is used to model the number of successes in a sample. Replacement is used when sampling. For sampling without replacement, use the Hypergeometric distribution.

To calculate, select **Binomial**, and set the following options:

**Probability of success**

Type a number (`float`

) between 0.0 and 1.0 that indicates the probability of success. The default is .5.**Number of trials**

Specify the number of trials.Use an

`integer`

, with a minimum value of 1. The default is 3.

### Cauchy

The Cauchy distribution is a symmetric continuous probability distribution.

To calculate, select **Cauchy**, and set the following options:

**Location**

Type a number (`double`

) that represents the location of the 0^{th}element.By specifying a value for the

**Location**parameter, you can shift the probability distribution up or down a numeric scale.The default is 0.0.

### ChiSquare

The chi-square distribution is a sum of the squares of k independent, standard, normal, random variables.

To calculate, select **ChiSquare**, and set the following options:

**Number of degrees of freedom**Type a number (`double`

) to specify the degrees of freedom. The default is 1.0.

### ChiSquareRightTailed

This option provides a right-tailed chi-squared distribution.

To calculate, select **ChiSquareRightTailed**, and set the following options:

**Number of degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom. The default is 1.0.

### Exponential

The exponential distribution is a distribution over the real numbers parameterized by one non-negative parameter.

To calculate, select **Exponential**, and set the following options:

**Lambda**

Type a number (`double`

) to use as the lambda parameter. The default is 1.0.

### FFisher

Generates the probability of the Fisher statistic for a sample, also known as the Fisher F-distribution. This distribution is two-tailed.

To calculate, select **FFisher**, and set the following options:

**Numerator degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom that is used in the numerator. The default is 3.0.**Denominator degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom that is used in the denominator. The default is 6.0.

### FFisherRightTailed

Creates a right-tailed Fisher distribution. The Fisher distribution is also known as the Fisher F-distribution, Snedecor distribution, or Fisher-Snedecor distribution. This particular form of the distribution is right-tailed.

To calculate, select **FFisherRightTailed**, and set the following options:

**Numerator degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom that is used in the numerator. The default is 3.0.**Denominator degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom that is used in the denominator. The default is 6.0.

### Gamma

The gamma distribution is a family of continuous probability distributions with two parameters. For example, chi-squared is a special case of the gamma distribution.

To calculate, select **Gamma**, and set the following options:

**Scale**

Type a value to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Location**

Type a number (`double`

) that represents the location of the 0^{th}element.By specifying a value for the

**Location**parameter, you can shift the probability distribution up or down a numeric scale.The default is 0.0.

### GeneralizedExtremeValues

Creates a distribution developed to handle extreme values. The generalized extreme value (GEV) distribution is actually a group of continuous probability distributions that combines the Gumbel, Fréchet, and Weibull distributions (also known as type I, II, and III extreme value distributions).

For more information about extreme value theory, see this article in Wikipedia: Fisher-Tippet-Gnedenko theorem.

To calculate, select **GeneralizedExtremeValues**, and set the following options:

**Shape**

Type a value to change the shape of the distribution.A shape parameter is any parameter of a probability distribution that does not define its location or scale. Therefore, when you enter a value for shape, the parameter changes the shape of the distribution rather than moving, stretching, or shrinking it.

The value must be a number (

`double`

). The default is 1.0.**Scale**

Type a value to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Location**

Type a number (`double`

) that represents the location of the 0^{th}element.By typing a value for the

**Location**parameter, you can shift the probability distribution up or down a numeric scale.The default is 0.0.

### Geometric

The geometric distribution is a distribution over positive integers parameterized by one positive real number.

To calculate, select **Geometric**, and set the following options:

**Probability of success**

Type a number (`float`

) between 0.0 and 1.0 that indicates the probability of success. The default is .5.

Note

This implementation of the geometric distribution does not generate zeros.

### GumbelMax

The Gumbel distribution is one of several extreme value distributions. The **GumbelMax** option implements the Maximum Extreme Value Type 1 distribution.

To calculate, select **GumbelMax**, and set the following options:

**Scale**

Type a value to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Location**

Type a number (`double`

) that represents the location of the 0^{th}element.By typing a value for the

**Location**parameter, you can shift the probability distribution up or down a numeric scale.The default is 0.0.

### GumbelMin

The Gumbel distribution is one of several extreme value distributions. The Gumbel distribution is also referred to as the Smallest Extreme Value (SEV) distribution or the Smallest Extreme Value (Type I) distribution. The **GumbelMin** option implements the Minimum Extreme Value Type 1 distribution.

To calculate, select **GumbelMin**, and must set the following options:

**Scale**

Type a value to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Location**

Type a number (`double`

) that represents the location of the 0^{th}element.By typing a value for the

**Location**parameter, you can shift the probability distribution up or down a numeric scale.The default is 0.0.

### Hypergeometric

The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement, just as the binomial distribution describes the number of successes for draws with replacement.

To calculate, select **Hypergeometric**, and set the following options:

**Number of samples**

Type an integer that indicates the number of samples to use. The default is 9.**Number of success**

Type an integer that defines the value for success. The default is 24.**Population size**

Specify the population size to use when estimating the hypergeometric distribution.

### Laplace

The Laplace distribution is a distribution over the real numbers, parameterized by a mean and by a scale parameter.

To calculate, select **Laplace** distribution, and set the following options:

**Scale**

Type a value to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Location**

Type a number (`double`

) that represents the location of the 0^{th}element.**Location**parameter, you can shift the probability distribution up or down a numeric scale.The default is 0.0.

### Logistic

The logistic distribution is similar to the normal distribution, but it has no limit on the left side of the distribution. The logistic distribution is used in logistic regression and neural network models and for modeling life sciences data.

To calculate, select **Logistic**, and set the following options:

**Scale**

Type a value to use for scaling the distribution.By applying a scale value to the distribution, you can shrink or stretch it.

The default value is 1.0. Values must be positive numbers.

**Mean**

Type a number (`double`

)that indicates the estimated mean value of the distribution. The default is 0.0.

### Lognormal

The lognormal distribution is a continuous univariate distribution.

To calculate, select **Lognormal**, and set the following options:

**Mean**

Type a number (`double`

) that indicates the estimated mean value of the distribution. The default is 0.0.**Standard deviation**

Type a positive number (`double`

) that indicates the estimated standard deviation of the distribution. The default is 1.0.

### NegativeBinomial

The negative binomial distribution is a distribution over the natural numbers with two parameters (`r`

, `p`

). In the special case that `r`

is an integer, you can interpret the distribution as the number of tails before the r^{th} head when the probability of the head is p.

To calculate, select **NegativeBinomial**, and set the following options:

**Probability of success**

Type a number (`float`

) between 0.0 and 1.0 that indicates the probability of success. The default is .5.**Number of success**

Type an integer that specifies the value for success. The default is 24.

### Normal

The normal distribution is also known as the Gaussian distribution.

To calculate, select **Normal**, and set the following options:

**Mean**

Type a number (`double`

) that indicates the estimated mean value of the distribution. The default is 0.0.**Standard deviation**

Type a positive number (`double`

) that indicates the estimated standard deviation of the distribution. The default is 1.0.

### Pareto

The Pareto distribution is a power-law probability distribution that coincides with social, scientific, geophysical, actuarial, and many other types of observable phenomena.

To calculate, select **Pareto**, and set the following options:

**Shape**

Type a value (optional) to change the shape of the distribution.A shape parameter is any parameter of a probability distribution that does not define its location or scale. Therefore, when you enter a value for shape, the parameter changes the shape of the distribution rather than moving, stretching, or shrinking it.

The value must be a number (

`double`

). The default is 1.0.**Scale**

Type a value (optional) to change the scale of the distribution. By applying a scale value to the distribution, you can shrink or stretch it.The value must be a number (

`double`

). The default is 1.0.

### Poisson

In this implementation, Knuth's method is used to generate Poisson distributed random variables. For more information about the Poisson distribution, see Poisson Regression.

To calculate, select **Poisson**, and set the following options:

**Mean**

Type a number (`double`

) that indicates the estimated mean value of the distribution. The default is 0.0.

### Rayleigh

The Rayleigh distribution is a continuous probability distribution. As an example of how it arises, the wind speed will have a Rayleigh distribution if the components of the two-dimensional wind velocity vector are uncorrelated and normally distributed with equal variance.

To calculate, select **Rayleigh**, and set the following options:

**Lower bound**

Type a number (`double`

) that represents the lower bound of the distribution. The default is 0.0.

### StandardNormal

This option provides the standard normal distribution, with no other parameters.

To calculate, select **StandardNormal**, and select the columns.

### TStudent

This option implements the univariate Student’s t-distribution.

To calculate, select **TStudent**, and set the following options:

**Number of degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom. The default is 1.0.

### TStudentRightTailed

Implements the univariate Student’s t-distribution by using one right tail.

To calculate, select **TStudentRightTailed**, and set the following options:

**Number of degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom. The default is 1.0.

### TStudentTwoTailed

Implements a two-tailed Student’s t-distribution.

To calculate, select **TStudentTwoTailed**, and set the following options:

**Number of degrees of freedom**

Type a number (`double`

) to specify the degrees of freedom. The default is 1.0.

### Uniform

The uniform distribution is also known as the rectangular distribution.

To calculate, select **Uniform**, and set the following options:

**Lower bound**

Type a number (`double`

) that represents the lower limit of the distribution. The default is 0.0.**Upper bound**

Type a number (`double`

) that represents the upper limit of the distribution. The default is 1.0.

### Weibull

The Weibull distribution is widely used in reliability engineering. You can use its **Shape** parameter to model many other distributions.

To calculate, select **Weibull**, and set the following options:

**Shape**

Type a value (optional) to change the shape of the distribution.The value must be a number (

`double`

). The default is 1.0.**Scale**

Type a value (optional) to change the scale of the distribution. By applying a scale value to the distribution, you can shrink or stretch it.The value must be a number (

`double`

). The default is 1.0.

## Technical notes

This section contains implementation details, tips, and answers to frequently asked questions.

### Implementation details

This module supports all distributions that are provided in the open source MATH.NET Numerics library. For more information, see the documentation for the Math.Net.Numerics.Distribution library.

Right-tailed and two-tailed distributions appear as separate distributions, not as parameterized versions of base distributions. The current behavior is to preserve compatibility with Excel.

### Definitions

This module supports calculating any of these values for the specified distribution:

**cdf**, or the*cumulative distribution function*Returns the probability for a compound event, defined as the sum of ocurrences when the random variable takes a value smaller than some specific value x.

In other words, it answers the question: "How common are samples that are less than or equal to this value?"

This function can be used with both continuous and discrete numeric variables.

**InverseCdf**, or the*inverse cumulative distribution function*Returns the value associated with a specific cumulative probability value (cdf).

In other words, it answers the question: "What is the value of x at which the cdf function returns the cumulative probability y?"

**pdf**, or the*probability density function*Describes the relative likelihood for a random variable to be a specific value.

In other words, it answers the question: "How common are samples at exactly this value?"

## Expected inputs

Name | Type | Description |
---|---|---|

Dataset | Data Table | Input dataset |

## Module parameters

Name | Range | Type | Default | Description |
---|---|---|---|---|

Distribution | Any | ProbabilityDistribution | StandardNormal | Select the kind of probability distribution to generate. |

Method | Any | ProbabilityDistributionMethod | Cdf | Select the method to use when calculating the selected probability distribution. Options are the cumulative distribution function ( cdf), the inverse cumulative distribution function (InverseCdf), and the probability density function or mass function (pdf). |

Negative binomial distribution method | Any | ProbabilityDistributionMethodForNegativeBinomial | Cdf | If you select the negative binomial distribution, specify the method used for evaluating the distribution. |

Probability of success | [0.0;1.0] | Float | 0.5 | Type a value to use as the probability of success. |

Shape | Any | Float | 1.0 | Type a value that modifies the shape of the distribution. |

Scale | >=0.0 | Float | 1.0 | Type a value that changes the scale of the distribution to expand or shrink it in size. |

Number of trials | >=1 | Integer | 3 | Specify the number of trials. |

Lower bound | Any | Float | 0.0 | Type a number to use as the lower limit of the distribution |

Upper bound | Any | Float | 1.0 | Type a number to use as the upper limit of the distribution |

Location | Any | Float | 0.0 | Type the location of the zero element in the distribution. |

Number of degrees of freedom | Any | Float | 1.0 | Specify the number of degrees of freedom. |

Numerator degrees of freedom | Any | Float | 3.0 | Specify the number of degrees of freedom in the numerator. |

Denominator degrees of freedom | Any | Float | 6.0 | Specify the number of degrees of freedom in the denominator. |

Lambda | >=0.0 | Float | 1.0 | Specify a value for the Lambda parameter. |

Number of samples | Any | Integer | 9 | Specify the number of samples. |

Number of success | Any | Integer | 24 | Type a value to use as the number of success. |

Population size | Any | Integer | 52 | Specify the population size. |

Mean | Any | Float | 0.0 | Type the estimated mean value. |

Standard deviation | >=0.0 | Float | 1.0 | Type the estimated standard deviation. |

Column set | Any | ColumnSelection | Choose the columns over which to calculate the probability distribution. | |

Result mode | Any | OutputTo | ResultOnly | Specify how the results are to be saved in the output dataset. The options are to append new columns, replace existing columns, or output only the results. |

## Output

Name | Type | Description |
---|---|---|

Results dataset | Data Table | Output dataset |

## Exception

For a complete list of error messages, see Module Error Codes.

Exception | Description |
---|---|

Error 0017 | Exception occurs if one or more specified columns have a type that is unsupported by the current module. |

For a list of errors specific to Studio modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.