The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure the development of your data science projects. The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed. If you are using another data science lifecycle, such as CRISP-DM, KDD or your organization's own custom process, you can still use the task-based TDSP in the context of those development lifecycles.
This lifecycle has been designed for data science projects that are intended to ship as part of intelligent applications. These applications deploy machine learning or artificial intelligence models for predictive analytics. Exploratory data science projects and ad hoc or on-off analytics projects can also benefit from using this process, but in such cases some steps described may not be needed.
Here is a visual representation of the Team Data Science Process lifecycle.
The TDSP lifecycle is composed of five major stages that are executed iteratively. These include:
- Business Understanding
- Data Acquisition and Understanding
- Customer Acceptance
For each stage, we provide the following information:
- Goals: the specific objectives.
- How to do it: the specific tasks outlined and guidance provided on completing them.
- Artifacts: the deliverables and the support for producing them.
1. Business Understanding
- The key variables are specified that are to serve as the model targets and whose related metrics are used determine the success for the project.
- The relevant data sources are identified that the business has access to or needs to obtain.
How to do it
There are two main tasks addressed in this stage:
- Define Objectives: Work with your customer and other stakeholders to understand and identify the business problems. Formulate questions that define the business goals and that data science techniques can target.
- Identify data sources: Find the relevant data that helps you answer the questions that define the objectives of the project.
1.1 Define Objectives
A central objective of this step is to identify the key business variables that the analysis needs to predict. These variables are referred to as the model targets and the metrics associated with them are used to determine the success of the project. Two examples of such targets are sales forecast or the probability of an order being fraudulent.
Define the project goals by asking and refining "sharp" questions that are relevant and specific and unambiguous. Data science is the process of using names and numbers to answer such questions. For additional guidance on asking sharp questions, see How to do Data Science blog. Data science / machine learning is typically used to answer five types of questions:
- How much or how many? (regression)
- Which category? (classification)
- Which group? (clustering)
- Is this weird? (anomaly detection)
Which option should be taken? (recommendation)
Determine which of these questions you are asking and how answering it achieves your business goals.
Define the project team by specifying the roles and responsibilities of its members. Develop a high-level milestone plan that you iterate on as more information is discovered.
Define success metrics. For example: Achieve customer churn prediction accuracy of X% by the end of this 3-month project, so that we can offer promotions to reduce churn. The metrics must be SMART:
1.2 Identify Data Sources
Identify data sources that contain known examples of answers to your sharp questions. Look for the following data:
- Data that is Relevant to the question. Do we have measures of the target and features that are related to the target?
- Data that is an Accurate measure of our model target and the features of interest.
It is not uncommon, for example, to find that existing systems need to collect and log additional kinds of data to address the problem and achieve the project goals. In this case, you may want to look for external data sources or update your systems to collect new data.
Here are the deliverables in this stage:
- Charter Document: A standard template is provided in the TDSP project structure definition. This is a living document that is updated throughout the project as new discoveries are made and as business requirements change. The key is to iterate upon this document, adding more detail, as you progress through the discovery process. Keep the customer and other stakeholders involved in making the changes and clearly communicate the reasons for the changes to them.
- Data Sources: This is the Raw Data Sources section of the Data Definitions report that is found in the TDSP project Data Report folder. It specifies the original and destination locations for the raw data. In later stages, you fill in additional details like scripts to move the data to your analytic environment.
- Data Dictionaries: This document provides descriptions of the data that is provided by the client. These descriptions include information about the schema (data types, information on validation rules, if any) and the entity-relation diagrams if available.
2. Data Acquisition and Understanding
- A clean, high-quality dataset whose relations to the target variables are understood that are located in the appropriate analytics environment, ready to model.
- A solution architecture of the data pipeline to refresh and score data regularly has been developed.
How to do it
There are three main tasks addressed in this stage:
- Ingest the data into the target analytic environment.
- Explore the data to determine if the data quality is adequate to answer the question.
- Set up a data pipeline to score new or regularly refreshed data.
2.1 Ingest the data
Set up the process to move the data from source locations to the target locations where analytics operations like training and predictions are to be executed. For technical details and options on how to do this with various Azure data services, see Load data into storage environments for analytics.
2.2 Explore the data
Before you train your models, you need to develop a sound understanding of the data. Real-world datasets are often noisy or are missing values or have a host of other discrepancies. Data summarization and visualization can be used to audit the quality of your data and provide the information needed to process the data before it is ready for modeling. This process is often iterative.
TDSP provides an automated utility called IDEAR to help visualize the data and prepare data summary reports. We recommend starting with IDEAR first to explore the data to help develop initial data understanding interactively with no coding and then write custom code for data exploration and visualization. For guidance on cleaning the data, see Tasks to prepare data for enhanced machine learning.
Once you are satisfied with the quality of the cleansed data, the next step is to better understand the patterns that are inherent in the data that help you choose and develop an appropriate predictive model for your target. Look for evidence for how well connected the data is to the target and whether there is sufficient data to move forward with the next modeling steps. Again, this process is often iterative. You may need to find new data sources with more accurate or more relevant data to augment the dataset initially identified in the previous stage.
2.3 Set up a data pipeline
In addition to the initial ingestion and cleaning of the data, you typically need to set up a process to score new data or refresh the data regularly as part of an ongoing learning process. This can be done by setting up a data pipeline or workflow. Here is an example of how to set up a pipeline with Azure Data Factory.
A solution architecture of the data pipeline is developed in this stage. The pipeline is also developed in parallel with the following stages of the data science project. The pipeline may be batch-based or streaming/real-time or a hybrid depending on your business needs and the constraints of your existing systems into which this solution is being integrated.
The following are the deliverables in this stage.
- Data Quality Report: This report contains data summaries, relationships between each attribute and target, variable ranking etc. The IDEAR tool provided as part of TDSP can quickly generate this report on any tabular dataset such as a CSV file or a relational table.
- Solution Architecture: This can be a diagram or description of your data pipeline used to run scoring or predictions on new data once you have built a model. It also contains the pipeline to retrain your model based on new data. The document is stored in the Project directory when using the TDSP directory structure template.
- Checkpoint Decision: Before you begin full feature engineering and model building, you can reevaluate the project to determine whether the value expected is sufficient to continue pursing it. You may, for example, be ready to proceed, need to collect more data, or abandon the project as the data does not exist to answer the question.
- Optimal data features for the machine learning model.
- An informative ML model that predicts the target most accurately.
- An ML model that is suitable for production.
How to do it
There are three main tasks addressed in this stage:
- Feature Engineering: create data features from the raw data to facilitate model training.
- Model training: find the model that answers the question most accurately by comparing their success metrics.
- Determine if your model is suitable for production.
3.1 Feature Engineering
Feature engineering involves inclusion, aggregation and transformation of raw variables to create the features used in the analysis. If you want insight into what is driving a model, then you need to understand how features are related to each other and how the machine learning algorithms are to use those features. This step requires a creative combination of domain expertise and insights obtained from the data exploration step. This is a balancing act of finding and including informative variables while avoiding too many unrelated variables. Informative variables improve our result; unrelated variables introduce unnecessary noise into the model. You also need to generate these features for any new data obtained during scoring. So the generation of these features can only depend on data that is available at the time of scoring. For technical guidance on feature engineering when using various Azure data technologies, see Feature engineering in the Data Science Process.
3.2 Model Training
Depending on type of question you are trying answer, there are many modeling algorithms available. For guidance on choosing the algorithms, see How to choose algorithms for Microsoft Azure Machine Learning. Although this article is written for Azure Machine Learning, the guidance it provides is useful for any machine learning projects.
The process for model training includes the following steps:
- Split the input data randomly for modeling into a training data set and a test data set.
- Build the models using the training data set.
- Evaluate (training and test dataset) a series of competing machine learning algorithms along with the various associated tuning parameters (known as parameter sweep) that are geared toward answering the question of interest with the current data.
- Determine the “best” solution to answer the question by comparing the success metric between alternative methods.
Avoid leakage: Data leakage can be caused by the inclusion of data from outside the training dataset that allows a model or machine learning algorithm to make unrealistically good predictions. Leakage is a common reason why data scientists get nervous when they get predictive results that seem too good to be true. These dependencies can be hard to detect. To avoid this often requires iterating between building an analysis data set, creating a model, and evaluating the accuracy.
We provide an Automated Modeling and Reporting tool with TDSP that is able to run through multiple algorithms and parameter sweeps to produce a baseline model. It also produces a baseline modeling report summarizing performance of each model and parameter combination including variable importance. This process is also iterative as it can drive further feature engineering.
The artifacts produced in this stage include:
- Feature Sets: The features developed for the modeling are described in the Feature Sets section of the Data Definition report. It contains pointers to the code to generate the features and description on how the feature was generated.
- Model Report: For each model that is tried, a standard, template-based report that provides details on each experiment is produced.
- Checkpoint Decision: Evaluate whether the model is performing well enough to deploy it to a production system. Some key questions to ask are:
- Does the model answer the question with sufficient confidence given the test data?
- Should try any alternative approaches: collect additional data, do more feature engineering, or experiment with other algorithms?
- Models with a data pipeline are deployed to a production or production-like environment for final user acceptance.
How to do it
The main task addressed in this stage:
- Operationalize the model: Deploy the model and pipeline to a production or production-like environment for application consumption.
4.1 Operationalize a model
Once you have a set of models that perform well, they can be operationalized for other applications to consume. Depending on the business requirements, predictions are made either in real-time or on a batch basis. To be operationalized, the models have to be exposed with an open API interface that is easily consumed from various applications such as online websites, spreadsheets, dashboards, or line of business and backend applications. For examples of model operationalization with an Azure Machine Learning web service, see Deploy an Azure Machine Learning web service. It is also a best practice to build telemetry and monitoring into the production model and the data pipeline deployed to help with subsequent system status reporting and troubleshooting.
- Status dashboard of system health and key metrics.
- Final modeling report with deployment details.
- Final solution architecture document.
5. Customer Acceptance
- Finalize the project deliverables: confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.
How to do it
There are two main tasks addressed in this stage:
- System validation: confirm the deployed model and pipeline are meeting customer needs.
- Project hand-off: to the entity that is to run the system in production.
The customer should validate that the system meets their business needs and the answers the questions with acceptable accuracy to deploy the system to production for use by their client application. All the documentation is finalized and reviewed. A hand-off of the project to the entity responsible for operations is completed. This could be, for example, an IT or customer data science team or an agent of the customer that is responsible for running the system in production.
The main artifact produced in this final stage is the Exit Report of Project for Customer. This is the technical report containing all details of the project that useful to learn about and operate the system. An Exit Report template is provided by TDSP that can be used as is or customized for specific client needs.
The Team Data Science Process lifecycle is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models. These models can be deployed in a production environment to be leveraged to build intelligent applications. The goal of this process lifecycle is to continue to move a data science project forward towards a clear engagement end point. While it is true that data science is an exercise in research and discovery, being able to clearly communicate this to your team and your customers using a well defined set of artifacts that employees standardized templates can help avoid misunderstanding and increase the chance of a successful completion of a complex data science project.
Full end-to-end walkthroughs that demonstrate all the steps in the process for specific scenarios are also provided. They are listed and linked with thumbnail descriptions in the Team Data Science Process walkthroughs topic.