Data Mining Concepts
Data mining is frequently described as "the process of extracting valid, authentic, and actionable information from large databases." In other words, data mining derives patterns and trends that exist in data. These patterns and trends can be collected together and defined as a mining model. Mining models can be applied to specific business scenarios, such as:
- Forecasting sales.
- Targeting mailings toward specific customers.
- Determining which products are likely to be sold together.
- Finding sequences in the order that customers add products to a shopping cart.
An important concept is that building a mining model is part of a larger process that includes everything from defining the basic problem that the model will solve, to deploying the model into a working environment. This process can be defined by using the following six basic steps:
- Defining the Problem
- Preparing Data
- Exploring Data
- Building Models
- Exploring and Validating Models
- Deploying and Updating Models
The following diagram describes the relationships between each step in the process, and the technologies in Microsoft SQL Server 2005 that you can use to complete each step.
Although the process that is illustrated in the diagram is circular, each step does not necessarily lead directly to the next step. Creating a data mining model is a dynamic and iterative process. After you explore the data, you may find that the data is insufficient to create the appropriate mining models, and that you therefore have to look for more data. You may build several models and realize that they do not answer the problem posed when you defined the problem, and that you therefore must redefine the problem. You may have to update the models after they have been deployed because more data has become available. It is therefore important to understand that creating a data mining model is a process, and that each step in the process may be repeated as many times as needed to create a good model.
SQL Server 2005 provides an integrated environment for creating and working with data mining models, called Business Intelligence Development Studio. The environment includes data mining algorithms and tools that make it easy to build a comprehensive solution for a variety of projects. For more information about using BI Development Studio, see Developing Analysis Services Solutions and Projects.
For more information about how to apply SQL Server tools to the business scenarios listed earlier in this topic, see Data Mining Tutorial.
Defining the Problem
The first step in the data mining process, as highlighted in the following diagram, is to clearly define the business problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining the final objective for the data mining project. These tasks translate into questions such as the following:
- What are you looking for?
- Which attribute of the dataset do you want to try to predict?
- What types of relationships are you trying to find?
- Do you want to make predictions from the data mining model or just look for interesting patterns and associations?
- How is the data distributed?
- How are the columns related, or if there are multiple tables, how are the tables related?
To answer these questions, you may have to conduct a data availability study, to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, you may have to redefine the project.
The second step in the data mining process, as highlighted in the following diagram, is to consolidate and clean the data that was identified in the Defining the Problem step.
Microsoft SQL Server 2005 Integration Services (SSIS) contains all the tools that you need to complete this step, including transforms to automate data cleaning and consolidation.
Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as flawed or missing entries. For example, the data might show that a customer bought a product before that customer was actually even born, or that the customer shops regularly at a store located 2,000 miles from her home. Before you start to build models, you must fix these problems. Typically, you are working with a very large dataset and cannot look through every transaction. Therefore, you have to use some form of automation, such as in Integration Services, to explore the data and find the inconsistencies.
The third step in the data mining process, as highlighted in the following diagram, is to explore the prepared data.
You must understand the data in order to make appropriate decisions when you create the models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. After you explore the data, you can decide if the dataset contains flawed data, and then you can devise a strategy for fixing the problems.
Data Source View Designer in BI Development Studio contains several tools that you can use to explore data.
The fourth step in the data mining process, as highlighted in the following diagram, is to build the mining models.
Before you build a model, you must randomly separate the prepared data into separate training and testing datasets. You use the training dataset to build the model, and the testing dataset to test the accuracy of the model by creating prediction queries. You can use the Percentage Sampling Transformation in Integration Services to split the dataset.
You will use the knowledge that you gain from the Exploring Data step to help define and create a mining model. A model typically contains input columns, an identifying column, and a predictable column. You can then define these columns in a new model by using the Data Mining Extensions (DMX) language, or the Data Mining Wizard in BI Development Studio. For more information about how to use DMX, see Data Mining Extensions (DMX) Reference. For more information about how to use the Data Mining Wizard, see Data Mining Wizard.
After you define the structure of the mining model, you process it, populating the empty structure with the patterns that describe the model. This is known as training the model. Patterns are found by passing the original data through a mathematical algorithm. SQL Server 2005 contains a different algorithm for each type of model that you can build. You can use parameters to adjust each algorithm.
A mining model is defined by a data mining structure object, a data mining model object, and a data mining algorithm.
Microsoft SQL Server 2005 Analysis Services (SSAS) includes the following algorithms:
- Microsoft Decision Trees Algorithm
- Microsoft Clustering Algorithm
- Microsoft Naive Bayes Algorithm
- Microsoft Association Algorithm
- Microsoft Sequence Clustering Algorithm
- Microsoft Time Series Algorithm
- Microsoft Neural Network Algorithm (SSAS)
- Microsoft Logistic Regression Algorithm
- Microsoft Linear Regression Algorithm
Exploring and Validating Models
The fifth step in the data mining process, as highlighted in the following diagram, is to explore the models that you have built and test their effectiveness.
You do not want to deploy a model into a production environment without first testing how well the model performs. Also, you may have created several models and will have to decide which model will perform the best. If none of the models that you created in the Building Models step perform well, you may have to return to a previous step in the process, either by redefining the problem or by reinvestigating the data in the original dataset.
You can explore the trends and patterns that the algorithms discover by using the viewers in Data Mining Designer in BI Development Studio. You can also test how well the models create predictions by using tools in the designer such as the lift chart and classification matrix. These tools require the testing data that you separated from the original dataset in the model-building step.
Deploying and Updating Models
The last step in the data mining process, as highlighted in the following diagram, is to deploy to a production environment the models that performed the best.
After the mining models exist in a production environment, you can perform many tasks, depending on your needs. Following are some of the tasks you can perform:
- Use the models to create predictions, which you can then use to make business decisions. SQL Server provides the DMX language that you can use to create prediction queries, and Prediction Query Builder to help you build the queries.
- Embed data mining functionality directly into an application. You can include Analysis Management Objects (AMO) or an assembly that contains a set of objects that your application can use to create, alter, process, and delete mining structures and mining models. Alternatively, you can send XML for Analysis (XMLA) messages directly to an instance of Analysis Services.
- Use Integration Services to create a package in which a mining model is used to intelligently separate incoming data into multiple tables. For example, if a database is continually updated with potential customers, you could use a mining model together with Integration Services to split the incoming data into customers who are likely to purchase a product and customers who are likely to not purchase a product.
- Create a report that lets users directly query against an existing mining model.
Updating the model is part of the deployment strategy. As more data comes into the organization, you must reprocess the models, thereby improving their effectiveness.