Explore and prepare data with the Dataset class (Preview)

Learn how to explore and prepare data with the azureml-datasets package in the Azure Machine Learning SDK. The Dataset class (preview) enables you to explore and prepare your data by providing functions such as: sampling, summary statistics and intelligent transformations. Transformation steps are saved in Dataset definitions with the capability to handle multiple large files of different schemas in a highly scalable manner.

Important

Some Dataset classes (preview) have dependencies on the azureml-dataprep package (GA). While transformation functions can be done directly with the GA'ed Data Prep functions, we recommend the Dataset package wrappers described in this article if you are building a new solution. Azure Machine Learning Datasets (preview) allow you to not only transform your data, but also snapshot data and store versioned dataset definitions. Datasets is the next version of the Data Prep SDK, offering expanded functionality for managing datasets in AI solutions.

Prerequisites

To explore and prepare your data, you'll need:

Sampling

Take a sample of your data to get an initial understanding of your data architecture and content. At this time, the sample() method from the Dataset class supports Top N, Simple Random, and Stratified sampling strategies.

from azureml.core.dataset import Dataset
import random

# create an in-memory Dataset from a local file
dataset = Dataset.auto_read_files('./data/crime.csv')

# create seed for Simple Random and Stratified sampling
seed = random.randint(0, 4294967295)

Top N sample

For Top N sampling, the first n records of your Dataset are your sample. This is helpful if you are just trying to get an idea of what your data records look like or to see what fields are in your data.

top_n_sample_dataset = dataset.sample('top_n', {'n': 5})
top_n_sample_dataset.to_pandas_dataframe()
ID Case Number Date Block IUCR Primary Type ...
0 10498554 HZ239907 4/4/2016 23:56 007XX E 111TH ST 1153 DECEPTIVE PRACTICE ...
1 10516598 HZ258664 4/15/2016 17:00 082XX S MARSHFIELD AVE 890 THEFT ...
2 10519196 HZ261252 4/15/2016 10:00 104XX S SACRAMENTO AVE 1154 DECEPTIVE PRACTICE ...
3 10519591 HZ261534 4/15/2016 9:00 113XX S PRAIRIE AVE 1120 DECEPTIVE PRACTICE ...
4 10534446 HZ277630 4/15/2016 10:00 055XX N KEDZIE AVE 890 THEFT ...

Simple random sample

In Simple Random sampling, every member of the data population has an equal chance of being selected as a part of the sample. In the simple_random sample strategy, the records from your Dataset are selected based on the probability specified and returns a modified Dataset. The seed parameter is optional.

simple_random_sample_dataset = dataset.sample('simple_random', {'probability':0.3, 'seed': seed})
simple_random_sample_dataset.to_pandas_dataframe()
ID Case Number Date Block IUCR Primary Type ...
0 10516598 HZ258664 4/15/2016 17:00 082XX S MARSHFIELD AVE 890 THEFT ...
1 10519196 HZ261252 4/15/2016 10:00 104XX S SACRAMENTO AVE 1154 DECEPTIVE PRACTICE ...
2 10534446 HZ277630 4/15/2016 10:00 055XX N KEDZIE AVE 890 THEFT ...

Stratified sample

Stratified samples ensure that certain groups of a population are represented in the sample. In the stratified sample strategy, the population is divided into strata, or subgroups, based on similarities, and records are randomly selected from each strata according to the strata weights indicated by the fractions parameter.

In the following example, we group each record by the specified columns, and include said record based on the strata X weight information in fractions. If a strata is not specified or the record cannot be grouped, the default weight to sample is 0.

# take 50% of records with `Primary Type` as `THEFT` and 20% of records with `Primary Type` as `DECEPTIVE PRACTICE` into sample Dataset
fractions = {}
fractions[('THEFT',)] = 0.5
fractions[('DECEPTIVE PRACTICE',)] = 0.2

sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions, 'seed': seed})

sample_dataset.to_pandas_dataframe()
ID Case Number Date Block IUCR Primary Type ...
0 10516598 HZ258664 4/15/2016 17:00 082XX S MARSHFIELD AVE 890 THEFT ...
1 10534446 HZ277630 4/15/2016 10:00 055XX N KEDZIE AVE 890 THEFT ...
2 10535059 HZ278872 4/15/2016 4:30 004XX S KILBOURN AVE 810 THEFT ...

Explore with summary statistics

Detect anomalies, missing values, or error counts with the get_profile() method. This function gets the profile and summary statistics of your data, which in turn helps determine the necessary data preparation operations to apply.

dataset.get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Mean Standard Deviation Variance Skewness Kurtosis
ID FieldType.INTEGER 1.04986e+07 1.05351e+07 10.0 0.0 10.0 0.0 0.0 0.0 1.04986e+07 1.04992e+07 1.04986e+07 1.05166e+07 1.05209e+07 1.05259e+07 1.05351e+07 1.05351e+07 1.05351e+07 1.05195e+07 12302.7 1.51358e+08 -0.495701 -1.02814
Case Number FieldType.STRING HZ239907 HZ278872 10.0 0.0 10.0 0.0 0.0 0.0
Date FieldType.DATE 2016-04-04 23:56:00+00:00 2016-04-15 17:00:00+00:00 10.0 0.0 10.0 0.0 0.0 0.0
Block FieldType.STRING 004XX S KILBOURN AVE 113XX S PRAIRIE AVE 10.0 0.0 10.0 0.0 0.0 0.0
IUCR FieldType.INTEGER 810 1154 10.0 0.0 10.0 0.0 0.0 0.0 810 850 810 890 1136 1153 1154 1154 1154 1058.5 137.285 18847.2 -0.785501 -1.3543
Primary Type FieldType.STRING DECEPTIVE PRACTICE THEFT 10.0 0.0 10.0 0.0 0.0 0.0
Description FieldType.STRING BOGUS CHECK OVER $500 10.0 0.0 10.0 0.0 0.0 0.0
Location Description FieldType.STRING SCHOOL, PUBLIC, BUILDING 10.0 0.0 10.0 0.0 0.0 1.0
Arrest FieldType.BOOLEAN False False 10.0 0.0 10.0 0.0 0.0 0.0
Domestic FieldType.BOOLEAN False False 10.0 0.0 10.0 0.0 0.0 0.0
Beat FieldType.INTEGER 531 2433 10.0 0.0 10.0 0.0 0.0 0.0 531 531 531 614 1318.5 1911 2433 2433 2433 1371.1 692.094 478994 0.105418 -1.60684
District FieldType.INTEGER 5 24 10.0 0.0 10.0 0.0 0.0 0.0 5 5 5 6 13 19 24 24 24 13.5 6.94822 48.2778 0.0930109 -1.62325
Ward FieldType.INTEGER 1 48 10.0 0.0 10.0 0.0 0.0 0.0 1 5 1 9 22.5 40 48 48 48 24.5 16.2635 264.5 0.173723 -1.51271
Community Area FieldType.INTEGER 4 77 10.0 0.0 10.0 0.0 0.0 0.0 4 8.5 4 24 37.5 71 77 77 77 41.2 26.6366 709.511 0.112157 -1.73379
FBI Code FieldType.INTEGER 6 11 10.0 0.0 10.0 0.0 0.0 0.0 6 6 6 6 11 11 11 11 11 9.4 2.36643 5.6 -0.702685 -1.59582
X Coordinate FieldType.INTEGER 1.16309e+06 1.18336e+06 10.0 7.0 3.0 0.7 0.0 0.0 1.16309e+06 1.16309e+06 1.16309e+06 1.16401e+06 1.16678e+06 1.17921e+06 1.18336e+06 1.18336e+06 1.18336e+06 1.17108e+06 10793.5 1.165e+08 0.335126 -2.33333
Y Coordinate FieldType.INTEGER 1.8315e+06 1.908e+06 10.0 7.0 3.0 0.7 0.0 0.0 1.8315e+06 1.8315e+06 1.8315e+06 1.83614e+06 1.85005e+06 1.89352e+06 1.908e+06 1.908e+06 1.908e+06 1.86319e+06 39905.2 1.59243e+09 0.293465 -2.33333
Year FieldType.INTEGER 2016 2016 10.0 0.0 10.0 0.0 0.0 0.0 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 0 0 NaN NaN
Updated On FieldType.DATE 2016-05-11 15:48:00+00:00 2016-05-27 15:45:00+00:00 10.0 0.0 10.0 0.0 0.0 0.0
Latitude FieldType.DECIMAL 41.6928 41.9032 10.0 7.0 3.0 0.7 0.0 0.0 41.6928 41.6928 41.6928 41.7057 41.7441 41.8634 41.9032 41.9032 41.9032 41.78 0.109695 0.012033 0.292478 -2.33333
Longitude FieldType.DECIMAL -87.6764 -87.6043 10.0 7.0 3.0 0.7 0.0 0.0 -87.6764 -87.6764 -87.6764 -87.6734 -87.6645 -87.6194 -87.6043 -87.6043 -87.6043 -87.6484 0.0386264 0.001492 0.344429 -2.33333
Location FieldType.STRING (41.903206037, -87.676361925) 10.0 0.0 10.0 0.0 0.0 7.0

Impute missing values

In Datasets, null values, NaN's and values that contain no content are considered missing values. These can impact the performance of your machine learning models, or lead to invalid conclusions. Imputation is one common approach to address missing values, and is useful when you have a high percentage of missing value records that you cannot simply delete.

From the Dataset profile generated in the preceding section, we see that Latitude and Longitude columns have a high percentage of missing values. In this example, we calculate the mean and impute missing values for those two columns.

First, get the latest definition of the Dataset with get_definition() and pare down the data with keep_columns(), so we only view the columns we want to address.

from azureml.core.dataset import Dataset
import azureml.dataprep as dprep

# get the latest definition of Dataset
ds_def = dataset.get_definition()

# keep useful columns for this example
ds_def = ds_def.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])
ds_def.head(3)
ID Arrest Latitude Longitude
0 10498554 False 41.692834 -87.604319
1 10516598 False 41.744107 -87.664494
2 10519196 False NaN NaN

Next, check the MEAN value of the latitude column using the summarize() function. This function accepts an array of columns in the group_by_columns parameter to specify the aggregation level. The summary_columns parameter accepts the SummaryColumnsValue function, which specifies the current column name, the new calculated field name, and the SummaryFunction to perform.

lat_mean = ds_def.summarize(group_by_columns = ['Arrest'],
                            summary_columns = [dprep.SummaryColumnsValue(column_id = 'Latitude',
                                                                         summary_column_name = 'Latitude_MEAN',
                                                                         summary_function = dprep.SummaryFunction.MEAN)])
lat_mean = lat_mean.filter(lat_mean['Arrest'] == False)
lat_mean.head(1)
Arrest Latitude_MEAN
0 False 41.780049

Once we verify the values to impute, use ImputeMissingValuesBuilder to learn a fixed expression that imputes the columns with either a calculated MIN, MAX, MEAN value, or a CUSTOM value. When group_by_columns is specified, missing values will be imputed by group with MIN, MAX, and MEAN calculated per group.

The ImputeColumnArguments accepts a column_id string, and a ReplaceValueFunction to specify the impute type. For the missing longitude value, impute it with -87 based on external knowledge.

# impute with MEAN
impute_mean = dprep.ImputeColumnArguments(column_id = 'Latitude',
                                          impute_function = dprep.ReplaceValueFunction.MEAN)

# impute with custom value -87
impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',
                                            custom_impute_value=-87)

Impute steps can be chained together into a ImputeMissingValuesBuilder object using the Builders class function impute_missing_values(). The impute_columns parameter accepts an array of ImputeColumnArguments objects.

# get instance of ImputeMissingValuesBuilder
impute_builder = ds_def.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],
                                                   group_by_columns=['Arrest'])

Call the learn() function to store the impute steps, and apply them to a dataflow object using to_dataflow().

impute_builder.learn()
ds_def = impute_builder.to_dataflow()
ds_def.head(3)

As shown in the following output table, the missing latitude was imputed with the MEAN value of Arrest==False group, and the missing longitude was imputed with -87.

ID Arrest Latitude Longitude
0 10498554 False 41.692834 -87.604319
1 10516598 False 41.744107 -87.664494
2 10519196 False 41.780049 -87.000000

Update the Dataset definition with, update_definition() to keep the performed transformation steps.

dataset = dataset.update_definition(ds_def, 'Impute Missing')
dataset.head(3)
ID Arrest Latitude Longitude
0 10498554 False 41.692834 -87.604319
1 10516598 False 41.744107 -87.664494
2 10519196 False 41.780049 -87.000000

Create assertion rules

Frequently, the data we work with while cleaning and preparing data is just a subset of the total data we need for production. As a result, some of the assumptions we make as part of our cleaning might turn out to be false. For example, in a data set that updates continuously, a column that originally only contained numbers within a certain range might contain a wider range of values in later executions. These errors often result in either broken pipelines or bad data.

Datasets support creating assertions on data, which are evaluated as the pipeline executes. These assertions let us verify that our assumptions on the data continue to be accurate and, when not, to handle failures accordingly.

For example, if you want to constrain Latitude and Longitude values in your Dataset to specific numeric ranges, the assert_value() method ensures this is always the case.

from azureml.dataprep import value
from azureml.core.dataset import Dataset

# get the latest definition of the Dataset
ds_def = dataset.get_definition()

# set assertion rules for `Latitude` and `Longitude` columns
ds_def = ds_def.assert_value('Latitude', (value <= 90) & (value >= -90), error_code='InvalidLatitude')
ds_def = ds_def.assert_value('Longitude', (value <= 180) & (value >= -87), error_code='InvalidLongitude')

ds_def.get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Mean Standard Deviation Variance Skewness Kurtosis
ID FieldType.INTEGER 1.04986e+07 1.05351e+07 10.0 0.0 10.0 0.0 0.0 0.0 1.04986e+07 1.04992e+07 1.04986e+07 1.05166e+07 1.05209e+07 1.05259e+07 1.05351e+07 1.05351e+07 1.05351e+07 1.05195e+07 12302.7 1.51358e+08 -0.495701 -1.02814
Arrest FieldType.BOOLEAN False False 10.0 0.0 10.0 0.0 0.0 0.0
Latitude FieldType.DECIMAL 41.6928 41.9032 10.0 0.0 10.0 0.0 0.0 0.0 41.6928 41.7185 41.6928 41.78 41.78 41.78 41.9032 41.9032 41.9032 41.78 0.0517107 0.002674 0.837593 1.05
Longitude FieldType.INTEGER -87 -87 10.0 0.0 10.0 0.0 3.0 0.0 -87 -87 -87 -87 -87 -87 -87 -87 -87 -87 0 0 NaN NaN

From the profile, you see that the Error Count for the Longitude column is 3. The following code filters the Dataset, retrieves the error, and sees what value causes the assertion to fail. From here, adjust your code and cleanse your data appropriately.

from azureml.dataprep import col

ds_error = ds_def.filter(col('Longitude').is_error())
error = ds_error.head(10)['Longitude'][0]

print(error.originalValue)
-87.60431945

Derive columns by example

One of the more advanced tools for Datasets is the ability to derive columns using examples of desired results. This lets you give the SDK an example, so it can generate code to achieve the intended transformations.

from azureml.core.dataset import Dataset

# create an in-memory Dataset from a local file
dataset = Dataset.auto_read_files('./data/crime.csv')
dataset.head(3)
ID Case Number Date Block ...
0 10498554 HZ239907 2016-04-04 23:56:00 007XX E 111TH ST ...
1 10516598 HZ258664 2016-04-15 17:00:00 082XX S MARSHFIELD AVE ...
2 10519196 HZ261252 2016-04-15 10:00:00 104XX S SACRAMENTO AVE ...

Say you need to transform the date and time format to '2016-04-04 10PM-12AM'. In the derive_column_by_example() argument, provide examples of your desired output in the example_data parameter in this format: (original output, desired output).

The following code provides two examples of desired output, ("2016-04-04 23:56:00", "2016-04-04 10PM-12AM") and ("2016-04-15 17:00:00", "2016-04-15 4PM-6PM")

ds_def = dataset.get_definition()
ds_def = ds_def.derive_column_by_example(
        source_columns = "Date",
        new_column_name = "Date_Time_Range",
        example_data = [("2016-04-04 23:56:00", "2016-04-04 10PM-12AM"), ("2016-04-15 17:00:00", "2016-04-15 4PM-6PM")]
    )
ds_def.keep_columns(['ID','Date','Date_Time_Range']).head(3)

In the following table, notice that a new column, Date_Time_Range contains records in the format specified.

ID Date Date_Time_Range
0 10498554 2016-04-04 23:56:00 2016-04-04 10PM-12AM
1 10516598 2016-04-15 17:00:00 2016-04-15 4PM-6PM
2 10519196 2016-04-15 10:00:00 2016-04-15 10AM-12PM
# update Dataset definition to keep the transformation steps performed.
dataset = dataset.update_definition(ds_def, 'Derive Date_Time_Range')

Fuzzy groupings

When you gather data from different sources you might encounter variations in spelling, capitalization, or abbreviations of the same entities. Automatically standardize and reconcile these variants with the SDK's fuzzy grouping, or text clustering, functionality.

For example, the column inspections.business.city contains several forms of the city name "San Francisco".

from azureml.core.dataset import Dataset

# create an in-memory Dataset from a local json file
dataset = Dataset.auto_read_files('./data/city.json')
dataset.head(5)
inspections.business.business_id inspections.business_name inspections.business.address inspections.business.city ...
0 16162 Quick-N-Ezee Indian Foods 3861 24th St SF ...
1 67565 King of Thai Noodles Cafe 1541 TARAVAL St SAN FRANCISCO ...
2 67565 King of Thai Noodles Cafe 1541 TARAVAL St SAN FRANCISCO ...
3 68701 Grindz 832 Clement St SF ...
4 69186 Premier Catering & Events, Inc. 1255 22nd St S.F. ...

Let's use the fuzzy_group_column() method to add a column with the automatically detected canonical form of "San Francisco". The arguments source_column and new_column_name are required. You also have the option to:

  • Create a new column, similarity_score_column_name, that shows the similarity score between every pair of original and canonical values.

  • Provide a similarity_threshold, which is the minimum similarity score for the values to be grouped together.

# get the latest Dataset definition
ds_def = dataset.get_definition()
ds_def = ds_def.fuzzy_group_column(source_column='inspections.business.city',
                                       new_column_name='city_grouped',
                                       similarity_threshold=0.8,
                                       similarity_score_column_name='similarity_score')
ds_def.head(5)
inspections.business.business_id inspections.business_name inspections.business.address inspections.business.city city_grouped similarity_score ...
0 16162 Quick-N-Ezee Indian Foods 3861 24th St SF San Francisco 0.814806 ...
1 67565 King of Thai Noodles Cafe 1541 TARAVAL St SAN FRANCISCO San Francisco 1.000000 ...
2 67565 King of Thai Noodles Cafe 1541 TARAVAL St SAN FRANCISCO San Francisco 1.000000 ...
3 68701 Grindz 832 Clement St SF San Francisco 0.814806 ...
4 69186 Premier Catering & Events, Inc. 1255 22nd St S.F. San Francisco 0.814806 ...

In the resulting Dataset definition, all the different variations of representing "San Francisco" in the data were normalized to the same string, "San Francisco".

Save this fuzzy grouping step into the latest Dataset definition with update_definition().

dataset = dataset.update_definition(ds_def, 'fuzzy grouping')

Next steps