Measuring Mining Model Accuracy (Analysis Services - Data Mining)
SQL Server Analysis Services provides a variety of ways to determine whether your mining models are accurate.
You can use charts to visually represent the accuracy of a data mining model: A lift chart compares the accuracy of the predictions of each model. A profit chart displays the theoretical increase in profit that is associated with using each model. A scatter plot compares actual values against predicted values, and is used for regression models or other models that predict continuous attributes by using continuous inputs.
You can use a classification matrix to tabulate accurate and inaccurate predictions.
You can use cross-validation to statistically validate the reliability of your mining model.
For more information, see Tools for Charting Model Accuracy (Analysis Services - Data Mining). All of the chart types described in this topic can also be created programmatically by using XML/A or AMO. For more information about programming Analysis Services solutions, see Developer's Guide (Analysis Services - Data Mining).
After you have determined that the model is reasonably accurate, you should evaluate the results of the model in the context of the business problem that the model was designed to answer.
This topic explains the general workflow in Business Intelligence Development Studio or SQL Server Management Studio for measuring the accuracy of a mining model. Depending on the type of mining model that you are using and the chart you choose, some options might be slightly different, unavailable, or configured by default, but the overall process is as follows:
Choose a model or structure to test.
Choose test data.
Optionally filter the test data.
Choose a chart type and open it in the appropriate viewer.
Optionally, customize the chart.
For a step-by-step procedure, see How to: Create an Accuracy Chart for a Mining Model.
Choosing a Model or Structure
It is common practice to build several data mining models for each data mining structure, and for each model to use a different algorithm or set of parameters. Analysis Services makes it easy to test multiple related mining models in the same chart. However, you can also select just a single mining model for output in an accuracy chart.
In Business Intelligence Development Studio, you select a mining structure and then click the Mining Accuracy Chart tab to open a design area for accuracy charts. In SQL Server Management Studio, you right-click the mining structure and then select View Lift Chart to open the same design area and create any kind of accuracy chart.
If you add multiple mining models to a chart, you must choose mining models that have the same predictable attribute. If you create a lift chart, profit chart, scatter plot, or classification matrix, you can choose which models to chart by using the Input Selection tab. However, if you create a cross-validation report, Analysis Services will analyze all models that have the same predictable attribute.
Choosing Test Data
Before you can create a lift chart, profit chart, or scatter plot, you must specify the data that you will use for evaluating the model. Because the data that you use for evaluation greatly affects the results of evaluation, SQL Server 2008 Analysis Services provides the following options for specifying test data:
Using a holdout partition that was defined as the test set when you created the data mining structure. By using a part of the mining structure data as the test data, you can measure all of the models in the structure consistently.
Defining an external data source to use as the test data.
Defining an external data source, and applying filters to restrict the data to a relevant subset of cases. By defining a filter, you can select test data that meets particular conditions, or contains cases that are of particular interest.
Using a filter applied to the training cases used by the model. By defining a model filter, you can restrict cases to a subset of the data and also ensure that a particular model is always assessed by using similar data.
If you create a cross-validation report in Business Intelligence Development Studio or SQL Server Management Studio, by default Analysis Services uses the holdout partition defined for the model, if one exists. If no holdout partition has been defined, Analysis Services uses the entire set of training cases.
If you create a lift chart, profit chart, scatter plot, or classification matrix in Business Intelligence Development Studio or SQL Server Management Studio, you use the Input Selection Tab on the Mining Accuracy Chart view of the Data Mining Designer to specify the data to use for testing the mining models. The options are as follows:
Use mining model test cases
Mining model test cases are taken from the same data that is in the mining structure, but a filter is applied to the model to restrict the cases used for testing. A model filter is a set of conditions that you create and then store together with the mining model. Because the filter condition is saved with the mining model, it is applied by default when you train the model. When you test a model, you can use the filter as is, or you can use a different set of data for testing and thereby bypass the model filter. If you want to change the filter that is applied to a mining model, you can modify the model filter and then reprocess the model. Alternatively, you can create a copy of the model and then build a different filter on the copy.
For more information about model filters, see Creating Filters for Mining Models (Analysis Services - Data Mining).
For a description of how to create a filter on a mining model, see How to: Apply a Filter to a Mining Model.
Use mining structure test cases
You can use this option to measure accuracy if you defined a testing data set when you created the mining structure, by defining either a holdout percentage or a maximum number of cases to use for holdout. The definition of the testing set is stored with the structure. Therefore, the testing set can be used with any model that is based on the structure.
You cannot create a filter directly on the mining structure holdout data. However, if you want to filter this data, as a workaround, you can re-use the original data source view as an external data source, and apply a filter to the external data source.
For more information, see Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining).
Specify a different data set
In SQL Server 2005, the only way to test the accuracy of a mining model was to use an external data set. This option is still provided in SQL Server 2008, but now you can also define a filter on the external data.
To use an external data source, any columns from the external data that you want to use for input must be mapped to the columns in the mining model. You can choose to ignore some columns, but the external data set must at least contain one column that can be mapped to the predictable column in the mining model. Depending on the model, you might also need to map column or columns in the external data that have the attributes required for prediction.
In Data Mining Designer, you use the Input Selection tab and the Specify Column Mapping dialog box to select the input table that the models are to be validated against. When you select an input table, the columns in the tables Mining Structure and Select Input Table(s) are automatically mapped together. You can modify the mappings as needed by clicking a column in the Mining Structure table and dragging it into the Select Input Table(s) table. If the input data contains a nested table, you can also include this table by using the Select nested table link.
The predictable column must always be mapped. Columns that are not mapped are fed as NULL values to the mining model.
After you have mapped the corresponding columns, you can optionally specify a target state for the predictable column. If you leave the state of the predictable column blank, the lift chart predicts how well the model performs regardless of the state of the predictable column. For more information about the differences between creating lift charts with or without a specified state of the predictable column, see Lift Chart (Analysis Services - Data Mining).
You also have the option of creating a filter on the external data. For a description of how to create a filter, see How to: Filter the Input Rows for an Accuracy Chart.
The Synchronize Prediction Columns and Values option coordinates the predictable attributes in the external data source and in the mining model so that, even if they have a different name, they are derived from the same predictable column during model training. This is useful when you have two mining structure columns that point to the same underlying data, but perhaps the columns are labeled differently.
If you clear the Synchronize Prediction Columns and Values check box, you can select any valid predictable column and value, and the results are plotted together, even if the results do not make sense. You might clear this option if you are trying to compare two predictable attributes that do not explicitly map to the same set of values. However, you should know that the two attributes might not be comparable with regard to accuracy. For example, a model that discretizes income in buckets as High, Middle, and Low might be comparable to a model that groups income in ranges of 150,000+, 50,000-100,000, and 10,000-50,000. However, before you clear the check box, you should verify that the attributes are comparable.
For More Information:How to: Select a Predictable Column for an Accuracy Chart, Using the Data Mining Tools
You can filter the data that is used to train and test a data mining model in the following ways:
Create a filter that is stored with the model.
Apply a filter to an external data source.
When you define a filter, you are essentially creating a WHERE clause on the incoming data. If you are filtering an input data set used for evaluating a model, the filter expression is translated to a Transact-SQL statement and applied to the input table when the chart is created. As a result, the number of test cases can be greatly reduced.
In contrast, when you apply a filter to a mining model, the filter expression that you create is translated to a Data Mining Extensions (DMX) statement, and applied to the individual model. When you apply a filter to a model, only a subset of the original data is used to train the model. If you defined a testing data set when you created the structure, the model cases used for training include only those cases that are in the mining structure training set and which meet the conditions of the filter. In addition, when you select the option Use mining model test cases, the testing cases include only cases that are in the mining structure test set and which meet the conditions of the filter. Filter conditions also apply to drillthrough queries on the model cases.
However, if you did not define a holdout data set, the model cases used for testing include all the cases in the data set that meet the filter conditions.
Therefore, multiple models, even if based on the same mining structure, can have different filters, and as a result they can use different data for training and testing. If you select the option Use mining model test cases when you create the accuracy chart, you should realize that the total number of cases in both the testing and training sets can vary greatly among the models being tested.
If you add a filter to an existing mining model, or change the filter conditions, you must reprocess the mining model to see the effects of the filter.
To review the actual training cases that were used, you can create a DMX content query, such as the following:
SELECT * from <model>.CASES WHERE IsTrainingCase()
SELECT * from <model>.CASES WHERE IsTestCase()
To compare these cases with the cases in the structure, create the following DMX content query:
SELECT * FROM <structure>.CASES WHERE IsTestCase()
To run a content query on the model cases, you must enable Drillthough on the model.
For information about the kinds of filters you can apply, and how filter expressions are evaluated, see Model Filter Syntax and Examples (Analysis Services - Data Mining).
Choosing a Chart Type and Viewing the Chart
Depending on the chart type that you choose, you have the ability to further configure options, to browse the chart, or copy the chart to the Clipboard and work with the data in Excel.
Note Charts and their definitions are not saved. If you close the window that contains a chart, you must create the chart again.
After you have configured the options for the models and the testing data, click the Lift Chart tab to view the results. You can also copy the chart to the Clipboard, or view details of individual trend lines or data points in the Mining Legend.
After you have configured the options for the models and the testing data, click the Lift Chart tab, select Profit Chart from the Chart Type list to set profit chart options, and then click OK to view the results. You can use the Profit Chart Settings dialog box as many times as you want to try different cost options and redisplay the chart. The Mining Legend contains detailed information about the estimated profit for each model. You can also copy the chart and the contents of the Mining Legend to the Clipboard to work with it in Excel.
For more information, see Profit Chart (Analysis Services - Data Mining) and Profit Chart Settings Dialog Box (Mining Accuracy Chart View).
If you have selected the appropriate type of model, when you click the Lift Chart tab, the chart type is automatically set to Scatter Plot and a scatter plot is displayed. No further configuration is possible. You can also copy the chart to the Clipboard and paste the chart as a graphic into Excel or another application.
For more information see Scatter Plot (Analysis Services - Data Mining).
For a classification matrix, use the Input Selection tab to choose the models and the testing data, and then click the Classification Matrix tab to view the results. The contents of a classification matrix are the same for all model types, and cannot be configured. You can also copy the data in the chart to the Clipboard and then work with it in Excel.
For more information, see Classification Matrix (Analysis Services - Data Mining) or Classification Matrix Tab (Mining Accuracy Chart View).
For a cross-validation report, after you have selected a mining structure or mining model in Solution Explorer, click the Cross Validation tab, configure all relevant options, and then click Get Results to generate the report. No further configuration is possible. The format of the cross-validation report is the same for all model types, and cannot be configured. However, the content of the report differs depending on the type of model that you are analyzing, and the data type of the predictable attribute. You can also copy the results of the report to the Clipboard and work with the data in Excel.
For more information, see Cross-Validation (Analysis Services - Data Mining) or Cross-Validation Report (Analysis Services - Data Mining).