Predicting Customer Profitability – A First Data Mining Model
by Richard Lees. You can view Richard’s Web site at:http://RichardLees.com.au.
Background of Data Mining
Data mining was once a very expensive technology that was used by small elite teams in head offices of a few very large companies. Now that data mining algorithms have become a commodity, anyone can employ data mining algorithms at a very low cost. For example, data mining tools have been included in SQL Server since 1998, and new tools, such as the Data Mining Add-Ins for Excel, can be downloaded for free.
The purpose of this paper is to walk you through a complete, real-life scenario for using data mining to predict customer profitability. Imagine you are a commercial organization, continually receiving requests for goods or services from new customers. You want to predict how much business a new customer will bring you in the longer term. This might help you determine which customers deserve special consideration. To do this you will use your history of new customer orders and how the customers developed in the longer term. In this walkthrough, you will use a classic data mining method to analyze real-life data.
Sample Data for Data Mining
In data mining, you really need to have real data to discover interesting relationships. So for this paper we will use some real data that I have, although it is not customer data. The data is the HTTP web logs from my Internet site. The structure of the Internet log data is similar to what a commercial organization might have for new customer requests. Customer requests would typically have attributes such as the customer name and demographics, product and channel, whereas the http requests have such attributes as the client, the location from which the request was issued, agent, resource requested, etc. While a commercial organization might want to predict profitability, we will predict http response times. Note that the purpose of predicting web response times is only to serve as a good learning exercise and demonstration of data mining.
You can see a live demonstration of the model we create in this paper at the site, http://RichardLees.com.au. Just click on the BI Demonstations/Data Mining/Web Response Time links. The demonstration predicts web response times for the last 50 hits to the site using a data mining model based on decision trees with regressors. The model is reprocessed each weekend to keep it current. The online report compares actual versus predicted response times and plots them on a scatter graph.
Predicting Customer Profitability
Essentially our challenge is that we have a collection of attributes, some discrete, some continuous, and we want to predict a continuous variable (a number). A typical customer profitability model would use customer demographics, channel, product requested, volume etc to predict future profitability. Our model will use geographic location, client agent, resource etc to predict response time.
There are thousands of ways that you could use a very similar structure to predict a continuous number for a different real-world application. These applications include predicting delivery time, project time, project revenue, employee tenure, lease duration, residual value etc. Therefore, once you have gone through this exercise, it should be very easy for you to build your own customer profitability model or other similar model using your organization’s real data.
We will build a data mining model using the following attributes to predict response time
- Client operating system
- Client agent (browser and version)
- HTTP status
- HTTP operation
- Referring server
- Target resource
- Target resource type (e.g., .htm, .jpg, .zip)
- Bytes out
In a customer profitability prediction model, the attributes would be different, but in both cases they are a mix of discrete variables (such as country) and continuous variables (such as bytes out and ordered volume).
In any data mining exercise, one of the first tasks is to identify the input variables and the output (predicted) variable(s). Note, in some data mining tasks there is no desire to predict a variable, for example some clustering exercises do not make predictions, they just cluster.
Designing the Data Mining Model in SQL Server 2008
You can reproduce this data mining model by downloading the DataMiningWebSample SQL Server database from the download page on http://RichardLees.com.au. This download file contains a SQL Server 2005/2008 backup file of the raw web logs, and a VS solution that contains the completed data mining model. You shouldn’t need the completed VS solution but it is provided in case you have trouble and want to view a completed model.
Building the Data Mining Model
1. Start the SQL Server Business Intelligence Development Studio from Start/Programs/SQL Server 2008
2. From the File menu, click New/Project and select Analysis Services Project.
3. Add a Data Source that points to the SQL Server database that you have downloaded and restored.
4. Add a Data Source View, and add to it all the tables and views from the SQL database.
5. Right click Mining Structures and then click New Mining Structure.
6. Accept the default data mining technique of Decision Trees. Later you can add other models, based on different algorithms, and apply them to the same data.
7. Select InternetLog as the case table. We won’t be using a nested table for this model. Don’t select any other tables; click next. The key for InternetLog is the RID column.
8. Select all other columns as input columns, except ResponseTime and LogTime. We don’t need to select all columns for input, but because at this stage we are not sure which ones are useful, we can include them all as candidate columns.
9. Check the Predict column for ResponseTime. The diagram below show the dialog boxes provided by BIDS for choosing columns and setting the content types for each column.
10. Click Next and accept the default training size of 30 percent. If your model is taking too long to process, or if you are using a server without much memory or a slow CPU, you will want to change the model settings to reduce processing time. One way to reduce processing time is to increase the holdback percent, meaning you reduce the amount of data that is used for training. In a large data set, you can increase the size of holdback up to as high as 90 percent. Another way to speed up processing is to reduce the number of input variables. For example, bytes-in and city are probably of little use so you can remove them from your model.
11. Accept the default data mining structure name, but add DT to the end of the model name. This is to help you remember that that this model is using a decision trees algorithm.
12. Click Finish and process the data mining model.
Viewing the Data Mining Model
The data mining model has been processed and we can now view what the model has found.
Because our model is a decision tree structure, we can view the model in its tree view. There is one tree for every discrete outcome. Because this model has one continuous outcome (ResponseTime), there is only one tree. Note that your decision tree may look a little different from this picture, since SQL Server has randomly selected which records to hold back from training, which will be used to validate the model.
Step 1. Click on the Mining Model Viewer tab and select the nested Decision Tree tab. You will see the first five levels of the decision tree structure. Each node in the tree is represented in the viewer as a box with a decision heading such as ResourceType=gif. Also, if you focus on any one of the nodes, you can see the outcome at that node. If you focus on one of the nodes, you can read information about this node. For example, in the node labeled Resource=’/IndicatorImage.gif’, I can read the following:
- Criteria for this node : Bytes Out < 1821932 and OK not = 'NOK' and Resourcetype not = '.gif' and Resource = '/IndicatorImage.aspx'
- Support for this node: Existing Cases: 4122. This tells you how many records in the training data satisfy the criteria above.
- The prediction of our output variable: ResponseTime.Notice how the prediction is not just a number.This is because we had continuous input columns (BytesIn and BytesOut).Essentially the decision tree is using these continuous input variables as regressors to assist with the prediction.It is predicting the ResponseTime at this particular node to be 474.474+0.085*(Bytes In-1,017.390)-0.190*(Bytes Out-556.796).It is logical to me that it would use BytesIn and BytesOut in this way.The larger the number of bytes transferred, the longer response time, given other attributes are the same. Note that the units of ResponseTime are milliseconds, the same as the source data. ResponseTime = 474.474+0.085*(Bytes In-1,017.390)-0.190*(Bytes Out-556.796).
Step 2. Now click on the Dependency Network nested tab. This tab provides an extremely simplified view of the relative importance of the input variables in predicting our output variable. That is, the graph indicates how useful each of the variables is in predicting response time. I say it is extremely simplified because the attributes do not work alone, but they work in combinations. However, the diagram can be very useful: for example, I find this type of graph particularly useful in a basket analysis model, where it will highlight the dependencies between products.
Step 3. Click on Response Time. All of the input variables will change color to indicate that they help predicting response time.
Step 4. Now drag the slider bar on the left downward. Notice how some of the input variables are shaded out. These are the less important variables.
Step 5. Keep sliding the bar downward until you isolate the most important input variable. In my model it happens to be ResourceType. That is logical. For example, the response time will vary largely depending on whether the HTTP resource is a .jpb, aspx, zip.
SQL Server makes it very easy to try other algorithms for this model. I suggest that you try creating some additional models after you have completed this paper. To add a new model to existing data, simply click on the Create a related mining model button in the Mining Models tab and choose an appropriate algorithm. You won’t be able to use all the algorithms for this particular model. For example, the time series algorithm is not an appropriate choice, and some algorithms won’t like the continuous input and/or output variables. If you get an error, you can choose to ignore continuous input variables, or you might choose to discretize the continuous variables. SQL Server can help you do this.
Determining Model Accuracy
Microsoft SQL Server 2008 comes with tools to help you determine the accuracy of your model’s predictions. The diagram shows an accuracy chart that I created for the model in the preceding analysis.
In this response time model, we are predicting a continuous number. So how do we know if the model is good or bad, since it can’t really be ‘right’ or ‘wrong’ with a continuous number? A good way to validate and compare models is to determine the correlation coefficient between the actual response time and the predicted response time. The higher the correlation coefficient, the stronger the correlation, and therefore the better the model.
When testing the validity of your model it is important to test the model with data that it has not used for training. This data is also called ‘naive data’. When you create a data mining model, by default SQL Server will hold back 30 percent of the data for testing, though you can change this amount by setting the HoldOutMaxPercent property. To generate an accuracy chart, SQL Server will use this ‘held back’ data by default, but you can also opt to use your own dataset.
For this walkthrough, the sample database contains a table called InternetLogNew, which contains new HTTP records that the model also has not used for training. InternetLogNew is a view over log records that the model has not been trained on, and the view will randomly select 50 records from this table. Therefore, each time you refresh the scatter chart, a new 50 records will be selected, so the results might be slightly different depending on which records are chosen.
1. Click on Mining Accuracy Chart tab. In the Input Selection nested tab you are able to select the test data you wish to use. By default, it will use the data held back, but for this walkthrough, select another table, InternetLogNew.
2. Click on the Lift Chart and select Scatter Plot. SQL Server will plot predicted values against actual values for your test data against the model. If the data mining model was a perfect predictor, all the dots will be along the diagonal. If the data mining model is no good as a predictor, the dots will tend to be scattered randomly across the chart. Generally you will find that the dots will be scattered, with a tendency to group along the diagonal. However, do not expect to always see the data tightly clustered along the diagonal. Data mining models can be helpful, even though their correlation does not look very strong. Any correlation above 0 means that the model can help you make predictions more accurately than random.
Querying the Data Mining Model
This is an area where I believe Microsoft SQL Server has taken data mining to a new level. SQL Server makes it very easy to dynamically query the data using numbers entered at the client, or data coming from a relational database. The following procedure describes how to create queries to get predictions or get more details about the patterns in the model.
1. Click on Mining Model Prediction tab.
2. Click on Select Case Table button and select InternetLogNew.
3. In the lower grid, click on the first row below Source. In the drop down combo list select InternetLogNew table. In the Field column leave it at RID.
4. Repeat the above step for every column in the InternetLogNew table. Although it can be tedious, I suggest that you use the wizard this first time to build the basic query. After you learn how to edit the data mining query, you can bypass the UI and type the column names directly if you want.
5. Add one more row. This time, for source, select Prediction Function and in the Field column select Predict. There are actually two Predict functions, one for scalar and one for tabular arguments, but it won’t matter which one you choose.
6. Using your mouse, drag Response Time from the data mining model (not the SQL table) on the top left of your screen to the Criteria/Argument column. The wizard will fill out the field.
7. Click on the Switch to query result view button at the top left of the screen. You will now see the data from the new Internet records and the associated data mining prediction. That’s how easy it is to query your data mining model with fresh data. The actual query is not complex.
Click on the top left button and select query view. You will see the DMX query that the wizard has generated for you. The DMX language has been designed to be as close to the SQL language as possible. Of course the DMX language uses many functions that aren’t available in SQL, but I think you will find DMX a relatively easy language to learn. In my opinion it is much simpler to learn than MDX, which is the OLAP cube query language.
Note that the OpenQuery requests data from the relational database using the view, InternetLogNew. This view is a query that gets the latest Internet log records (or in our demonstration randomly selects log records that are not in the training data). The T-SQL statement for the view is provided here. You can see that it will get the top 50 records and that it is ordered by a random key, so that a different 50 records are retrieved on each request. In your production implementation, you would not order by a random key, but rather would select the last 50, or the top n records ordered by their predicted profitability, etc.
CREATE view [dbo].[InternetLogNew] as SELECT top 50 l.rid ,l.LogTime ,l.ResponseTime ,l.BytesIn ,l.BytesOut ,g.Country ,g.State ,g.City ,t.resource ,t.resourcetype ,ca.clientagnt ,ca.os ,case when l.ResponseTime>60000 and BytesOut=0 then 'NOK' else hc.OK end OK ,hc.http_status ,o.operation ,r.referringServer ,ch.ClientHost FROM InternetLogTableNew l inner join ClientHosts ch on ch.clienthostid=l.clienthostid inner join IPCountry ipc on ipc.IPCountryID=ch.IPCountryID inner join Geographies g on g.GeographyID=ipc.GeographyID inner join Targets t on t.Targetid=l.targetid inner join ClientAgents ca on ca.clientagentid=l.clientagentid inner join HTTP_codes hc on hc.httpstatuscode=l.Status inner join Operations o on o.operationid=l.operationid inner join Referrers r on r.ReferrerID=l.ReferrerID ORDER BY RAND(convert(float,bytesin)*DATEPART(ms,GETDATE())/1000.0) +RAND(convert(float,bytesout)*DATEPART(ms,GETDATE())/1000.0)
This particular query makes predictions for testing; however, another way customers are using data mining models is to make predictions in bulk. The predictions are then stored back in the data warehouse or in an OLAP cube for aggregated analysis. For example, you might predict summarized revenue by region, salesperson, or product group etc.
Sometimes data mining models are created only to perform an analysis and are then discarded. However, increasingly organizations are building data mining models, like the one above, that become part of the everyday applications available to a wide number of staff and/or customers. Because SQL Server has generated the query for you, it is a very easy task to put the query into a Reporting Services report, and you have a data mining prediction report. Use of data mining reports is transforming the way that data mining is employed. It is taking data mining from the exclusive preserve of elite head office teams to the whole organization.
For example, I created the following report based on the query above, and then created a scatter chart by using the Reporting Services charting component.
To create a Reporting Services report using the data mining model to make predictions from your model, follow these instructions. This will add a Reporting Services project to your data mining solution. I like the way BIDS (Business Intelligence Development Studio) can keep these projects in the same solution. It is not uncommon to have a BIDS solution with three projects:
• SSIS (Integration Services for data extraction and load)
• SSAS (Analysis Services for cube and/or data mining models)
• SSRS (Reporting Services).
1. In your Visual Studio project, right click on the solution and click Add New Project. Select Report Server Project, provide a name and click OK.
2. Right click on Shared Data Sources, and set up a new data source that points to the Analysis Services where your data mining model is deployed.
3. Right click on the Reports folder and select Add New Report.
4. Click Next.
5. Click Query Builder.
6. Click the right most button Design Mode.
7. Now paste the data mining query, the wizard created for you at step 8 above, into the query pane.
8. Click OK.
9. Click Next in the Design the Query pane.
10. Select Tabular as the Report Type.
11. Add all the Available Fields to the Details section of the report by highlighting the fields and clicking on the Details button.
12. Select your favorite Table Style, or leave to the default and click Next. You can change the format of the report later. Let’s just get the query and result set.
13. Enter a name for your report, such as Response Time Prediction.
14. Click Finish.
15. You are now able to preview your report in the Preview pane. You can change the formatting, edit the DMX query, add a scatter Plot chart etc to your report and deploy to your reporting services server.
16. Notice how fast the data mining report is to make 50 predictions. Generally, you will find that data mining predictions are very fast, it is just the model processing that can take some time.
Deploying and Updating the Data Mining Solution
When you create your production data mining model, you may want to schedule a regular update of the model. This is necessary because the factors that predict a profitable customer will change over time, and you want the data mining model to adapt to those changes. So, for example, you might have the mining model reprocess at the end of each week or month. During reprocessing, the original data mining model remains available for online querying.
In this short time we have created a data mining model, validated the model, and used the model to create a user-friendly report. When you embed data mining in a report, it also becomes accessible to non-technical or customer-facing employees who know little about data mining but can benefit from the predictions.
This white paper was intended to help you get started with data mining. Data mining is a wide and exciting area of business intelligence, and many organizations have yet to appreciate how they can profit from its use. I anticipate the application of data mining to grow exponentially over the next few years. By and large, the application of data mining is constrained only by our imagination.
For more information on Data Mining
• http://RichardLees.com.au (real time data mining demonstrations)
• http://RichardLees.blogspot.com (one of many data mining bloggers)
• http://msdn.microsoft.com/en-us/sqlserver/cc511476.aspx (Microsoft Technical Resources for Data Mining)
• http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742.html (good text about SQL Server Data Mining)
• http://www.microsoft.com/sqlserver/2008/en/us/data-mining.aspx (Microsoft data mining marketing page)
• http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/threads (SQL Server data mining forum)
• http://www.sqlserverdatamining.com (SQL Server Data Mining dev team’s site with tips/articles/whitepapers etc)
About the author. Richard Lees* *is a business intelligence developer and consultant. See his site for live demos based on SQL Server Data Mining.