Understanding the Requirements for a Time Series Model (Intermediate Data Mining Tutorial)

When you are preparing data for use in a forecasting model, you must ensure that your data contains a single column that can be used to identify the steps in a time series. That column will be used as the Key Time column and must contain unique numeric values. If there are gaps in the data other than at the beginning or end of a series, you can use the MISSING_VALUE_SUBSTITUTION parameter to fill in the series. Analysis Services provides several options for replacing missing data with values, such as using means or constants.

If you plan to analyze multiple series of data, you must ensure that all series start and end on the same date.

In this task, you will explore the time series data by creating a simple pivot table in Business Intelligence Development Studio. You will also learn about ways to find and handle gaps in data.

To identify the time key for the forecasting model

  1. In the pane, SalesByRegion.dsv [Design], right-click the table vTimeSeries, and then select Explore Data.

    A new tab opens, titled Explore vTimeSeries Table. This tab contains four tabs: Table, Pivot Table, Chart, and Pivot Chart.

  2. On the Table tab, click the Sampling options button to the right of the tabs.

  3. In the Data Exploration Options dialog box, note that the default sampling method, under Sampling methods, is to use the Top count option and sample the first 5000 rows.

  4. Change Sampling method to Random sample, and change Sample count to 1000, then click OK.

    Note

    Changing the sampling options does not affect the data that is used for data mining. The charts and tables are only tools to help you browse and understand the data.

  5. On the Table tab, review the data that is used in the TimeIndex and Reporting Date columns.

    Both are sequences with unique values; however, you want to check the data type of the columns.

    Note

    If you do not see the Reporting Date column, you probably have an older version of the AdventureWorksDW database. This tutorial requires the AdventureWorksDW2008R2 database, to take advantage of new date and time functionality in SQL Server 2008. For more information about the changes in dates and times, see Breaking Changes to Database Engine Features in SQL Server 2008 R2.

  6. Click the tab, SalesByRegion.dsv [Design].

  7. Select the column, Reporting Date.

    The Properties window indicates that the TimeIndex field has the data type System.Int32, whereas the field Reporting Date has the data type System.DateTime. The Microsoft Time Series algorithm does not require a datetime data type, only that the values be distinct and ordered. Therefore, either column can be used as the time key for the forecasting model. However, because you want to represent your sales data forecast by using calendar dates, you will use the Reporting Date column as the unique series identifier.

To set the key in the data source view

  1. In the pane SalesByRegion.dsv, select the vTimeSeries table.

  2. Right-click the column, Reporting Date, and select Set Logical Primary Key.

To check for gaps in the data series by using a pivot chart

  1. In the pane SalesByRegion.dsv [Design], right-click the table vTimeSeries, and then select Explore Data.

  2. Click the Pivot Table tab.

  3. In the PivotTable Field List, dialog box, select the fields Quantity and Amount and drag them to the Drag Totals or Detail Fields Here area of the table.

  4. Similarly, drag the field TimeIndex to the box, Drop Column Fields Here.

    Note

    The TimeIndex field is not the field that you will use to represent the time series, but is provided for backward compatibility. Also, the TimeIndex field provides a simpler view of the series that you can use when browsing the data in the pivot chart.

  5. Drag the field ModelRegion to the box, Drop Row Fields Here.

    The pivot table that is created shows the list of products and regions in the left column, followed by columns that contain the sales quantity and amount for each calendar month and year.

  6. Drag the scroll bar to the right to see the starting date of sales data for the T1000 series.

    You can see from this table that there is no data for the T1000 product line until July 2003. An important requirement of the Microsoft Time Series algorithm in SQL Server 2008 is that any series that you include in a single model should have the same ending point. Because the T1000 bicycle model was introduced in 2003, the data for this series starts later than for other bicycle models, but the series ends on the same date so the data is acceptable.

  7. Change the field used to represent the time series by dragging TimeIndex out of the table.

  8. Drag one of the following fields from the PivotTable Field List to the box, Drop Column Fields Here. Try viewing the time series in different ways, by dragging columns out of the table, or by selecting different aggregations, such as quarter or month.

    • ReportingDate

    • ReportingDate By Week

    • ReportingDate By Month

Handling Missing Data (Optional)

If any series has missing data, you might get an error when you try to process the model. You have several ways to work around missing data:

  • Letting Analysis Services fill in missing values, by calculating a mean, or by using a previous value. You do this by setting a parameter when you create the mining model. For more information, see Microsoft Time Series Algorithm Technical Reference.

  • Altering the data source or filtering the underlying view to eliminate the series or to replace values. You can do this in the relational data source, or you can modify the data source view by creating customer named queries or named calculations. For more information, see Designing Data Source Views (Analysis Services). A later task in this lesson provides an example of how to build both a named query and a custom calculation.

For this scenario, some data is missing at the beginning of one series, but all series end on the same date, and there are no missing values. Therefore you do not need to make any additional changes, because the data meets the requirements for a time series model.

To close the data source view designer

  • Right-click the tab, Explore vTimeSeries Table, and select Close.