When you create a query against a data mining model, you can create a content query, which provides details about the patterns discovered in analysis, or you can create a prediction query, which uses the patterns in the model to make predictions for new data. For example, a content query for a decision trees model might provide statistics about the number of cases at each level of the tree, or the rules that differentiate between cases. Alternatively, a prediction query maps the model to new data in order to generate recommendations, classifications, and so forth. You can also retrieve metadata about the model by using a query.
This section explains how to create queries for models that are based on the Microsoft Decision Trees algorithm.
To create meaningful queries on the content of a decision trees model, you should understand the structure of the model content, and which node types store what kind of information. For more information, see Mining Model Content for Decision Tree Models (Analysis Services - Data Mining).
By querying the data mining schema rowset, you can find metadata about the model, such as when it was created, when the model was last processed, the name of the mining structure that the model is based on, and the name of the column used as the predictable attribute. You can also return the parameters that were used when the model was first created.
select MINING_PARAMETERS from $system.DMSCHEMA_MINING_MODELS WHERE MODEL_NAME = 'TM_Decision Tree'
The following query returns some basic information about the decision trees that were created when you built the model in the Basic Data Mining Tutorial. Each tree structure is stored in its own node. Because this model contains a single predictable attribute, there is only one tree node. However, if you create an association model by using the Decision Trees algorithm, there might be hundreds of trees, one for each product.
This query returns all the nodes of type 2, which are the top level nodes of a tree that represents a particular predictable attribute.
The column, CHILDREN_CARDINALITY, must be enclosed in brackets to distinguish it from the MDX reserved keyword of the same name.
SELECT MODEL_NAME, NODE_NAME, NODE_CAPTION, NODE_SUPPORT, [CHILDREN_CARDINALITY] FROM TM_DecisionTrees.CONTENT WHERE NODE_TYPE = 2
What do these results tell you? In a decision trees model, the cardinality of a particular node tells you how many immediate children that node has. The cardinality for this node is 5, meaning that the model divided the target population of potential bike buyers into 5 subgroups.
The following related query returns the children for these five subgroups, together with the distribution of attributes and values in the child nodes. Because statistics such as support, probability, and variance are stored in the nested table, NODE_DISTRIBUTION, this example uses the
FLATTENED keyword to output the nested table columns.
The nested table column, SUPPORT, must be enclosed in brackets to distinguish it from the reserved keyword of the same name.
SELECT FLATTENED NODE_NAME, NODE_CAPTION, (SELECT ATTRIBUTE_NAME, ATTRIBUTE_VALUE, [SUPPORT] FROM NODE_DISTRIBUTION) AS t FROM TM_DecisionTree.CONTENT WHERE [PARENT_UNIQUE_NAME] = '000000001'
|00000000100||Number Cars Owned = 0||Bike Buyer||Missing||0|
|00000000100||Number Cars Owned = 0||Bike Buyer||0||1067|
|00000000100||Number Cars Owned = 0||Bike Buyer||1||1875|
|00000000101||Number Cars Owned = 3||Bike Buyer||Missing||0|
|00000000101||Number Cars Owned = 3||Bike Buyer||0||678|
|00000000101||Number Cars Owned = 3||Bike Buyer||1||473|
From these results, you can tell that of the customers who bought a bike ([Bike Buyer] = 1), 1067 customers had 0 cars and 473 customers had 3 cars.
Suppose you wanted to discover more about the customers who did buy a bike. You can view additional detail for any of the sub-trees by using the IsDescendant (DMX) function in the query, as shown in the following example. The query returns the count of bike purchasers by retrieving leaf nodes (NODE_TYPE = 4) from the tree that contains customers who are over 42 years of age. The query restricts rows from the nested table to those where Bike Buyer = 1.
SELECT FLATTENED NODE_NAME, NODE_CAPTION,NODE_TYPE, ( SELECT [SUPPORT] FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_NAME = 'Bike Buyer' AND ATTRIBUTE_VALUE = '1' ) AS t FROM TM_DecisionTree.CONTENT WHERE ISDESCENDANT('0000000010001') AND NODE_TYPE = 4
|000000001000100||Yearly Income >= 26000 and < 42000||266|
|00000000100010100||Total Children = 3||75|
|0000000010001010100||Number Children At Home = 1||75|
Making Predictions using a Decision Trees Model
Because decision trees can be used for various tasks, including classification, regression, and even association, when you create a prediction query on a decision tree model you have many options available to you. You must understand the purpose for which the model was created to understand the results of prediction. The following query samples illustrate three different scenarios:
Returning a prediction for a classification model, together with the probability of the prediction being correct, and then filtering the results by the probability;
Creating a singleton query to predict associations;
Retrieving the regression formula for a part of a decision tree where the relationship between the input and output is linear.
The following sample query uses the decision tree model that was created in the Basic Data Mining Tutorial. The query passes in a new set of sample data, from the table dbo.ProspectiveBuyers in AdventureWorks2012 DW, to predict which of the customers in the new data set will purchase a bike.
The query uses the prediction function PredictHistogram (DMX), which returns a nested table that contains useful information about the probabilities discovered by the model. The final WHERE clause of the query filters the results to return only those customers who are predicted as likely to buy a bike, with a probability greater than 0%.
SELECT [TM_DecisionTree].[Bike Buyer], PredictHistogram([Bike Buyer]) as Results From [TM_DecisionTree] PREDICTION JOIN OPENQUERY([Adventure Works DW Multidimensional 2012], 'SELECT [FirstName], [LastName], [MaritalStatus], [Gender], [YearlyIncome], [TotalChildren], [NumberChildrenAtHome], [HouseOwnerFlag], [NumberCarsOwned] FROM [dbo].[ProspectiveBuyer] ') AS t ON [TM_DecisionTree].[First Name] = t.[FirstName] AND [TM_DecisionTree].[Last Name] = t.[LastName] AND [TM_DecisionTree].[Marital Status] = t.[MaritalStatus] AND [TM_DecisionTree].[Gender] = t.[Gender] AND [TM_DecisionTree].[Yearly Income] = t.[YearlyIncome] AND [TM_DecisionTree].[Total Children] = t.[TotalChildren] AND [TM_DecisionTree].[Number Children At Home] = t.[NumberChildrenAtHome] AND [TM_DecisionTree].[House Owner Flag] = t.[HouseOwnerFlag] AND [TM_DecisionTree].[Number Cars Owned] = t.[NumberCarsOwned] WHERE [Bike Buyer] = 1 AND PredictProbability([Bike Buyer]) >'.05'
By default, Analysis Services returns nested tables with the column label, Expression. You can change this label by aliasing the column that is returned. If you do this, the alias (in this case, Results) is used as both the column heading and as the value in the nested table. You must expand the nested table to see the results.
Example results with Bike Buyer = 1:
If your provider does not support hierarchical rowsets, such as those shown here, you can use the FLATTENED keyword in the query to return the results as a table that contains nulls in place of the repeated column values. For more information, see Nested Tables (Analysis Services - Data Mining) or Understanding the DMX Select Statement.
The following sample query is based on the Association mining structure. To follow along with this example, you can add a new model to this mining structure, and select Microsoft Decision Trees as the algorithm. For more information on how to create the Association mining structure, see Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial).
The following sample query is a singleton query, which you can create easily in SQL Server Data Tools (SSDT) by choosing fields and then selecting values for those fields from a drop-down list.
SELECT PredictAssociation([DT_Association].[v Assoc Seq Line Items],3) FROM [DT_Association] NATURAL PREDICTION JOIN (SELECT (SELECT 'Patch kit' AS [Model]) AS [v Assoc Seq Line Items]) AS t
|Mountain Tire Tube|
|Touring Tire Tube|
The results tell you the three best products to recommend to customers who have purchased the Patch Kit product. You can also provide multiple products as input when you make recommendations, either by typing in values, or by using the Singleton Query Input dialog box and adding or removing values. The following sample query shows how the multiple values are provided, upon which to make a prediction. Values are connected by a UNION clause in the SELECT statement that defines the input values.
SELECT PredictAssociation([DT_Association].[v Assoc Seq Line Items],3) From [DT_Association] NATURAL PREDICTION JOIN (SELECT (SELECT 'Racing Socks' AS [Model] UNION SELECT 'Women''s Mountain Shorts' AS [Model]) AS [v Assoc Seq Line Items]) AS t
|Long-Sleeve Logo Jersey|
When you create a decision tree model that contains a regression on a continuous attribute, you can use the regression formula to make predictions, or you can extract information about the regression formula. For more information about queries on regression models, see Linear Regression Model Query Examples.
If a decision trees model contains a mixture of regression nodes and nodes that split on discrete attributes or ranges, you can create a query that returns only the regression node. The NODE_DISTRIBUTION table contains the details of the regression formula. In this example, the columns are flattened and the NODE_DISTRIBUTION table is aliased for easier viewing. However, in this model, no regressors were found to relate Income with other continuous attributes. In such cases, Analysis Services returns the mean value of the attribute and the total variance in the model for that attribute.
SELECT FLATTENED NODE_DISTRIBUTION AS t FROM DT_Predict. CONTENT WHERE NODE_TYPE = 25
For more information about the value types and the statistics used in regression models, see Mining Model Content for Linear Regression Models (Analysis Services - Data Mining).
List of Prediction Functions
All Microsoft algorithms support a common set of functions. However, the Microsoft Decision Trees algorithm supports the additional functions listed in the following table.
|IsDescendant (DMX)||Determines whether one node is a child of another node in the model.|
|IsInNode (DMX)||Indicates whether the specified node contains the current case.|
|PredictAdjustedProbability (DMX)||Returns the weighted probability.|
|PredictAssociation (DMX)||Predicts membership in an associative dataset.|
|PredictHistogram (DMX)||Returns a table of values related to the current predicted value.|
|PredictNodeId (DMX)||Returns the Node_ID for each case.|
|PredictProbability (DMX)||Returns probability for the predicted value.|
|PredictStdev (DMX)||Returns the predicted standard deviation for the specified column.|
|PredictSupport (DMX)||Returns the support value for a specified state.|
|PredictVariance (DMX)||Returns the variance of a specified column.|
For a list of the functions that are common to all Microsoft algorithms, see General Prediction Functions (DMX). For the syntax of specific functions, see Data Mining Extensions (DMX) Function Reference.