Sample data in Azure blob storage

This document covers sampling data stored in Azure blob storage by downloading it programmatically and then sampling it using procedures written in Python.

The following menu links to topics that describe how to sample data from various storage environments.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Download and down-sample data

  1. Download the data from Azure blob storage using the blob service from the following sample Python code:

     from import BlobService
     import tables
     STORAGEACCOUNTNAME= <storage_account_name>
     STORAGEACCOUNTKEY= <storage_account_key>
     LOCALFILENAME= <local_file_name>        
     CONTAINERNAME= <container_name>
     BLOBNAME= <blob_name>
     #download from blob
     print(("It takes %s seconds to download "+blobname) % (t2 - t1))
  2. Read data into a Pandas data-frame from the file downloaded above.

     import pandas as pd
     #directly ready from file on disk
     dataframe_blobdata = pd.read_csv(LOCALFILE)
  3. Down-sample the data using the numpy's random.choice as follows:

     # A 1 percent sample
     sample_ratio = 0.01 
     sample_size = np.round(dataframe_blobdata.shape[0] * sample_ratio)
     sample_rows = np.random.choice(dataframe_blobdata.index.values, sample_size)
     dataframe_blobdata_sample = dataframe_blobdata.ix[sample_rows]

Now you can work with the above data frame with the 1 Percent sample for further exploration and feature generation.

Upload data and read it into Azure Machine Learning

You can use the following sample code to down-sample the data and use it directly in Azure Machine Learning:

  1. Write the data frame to a local file

     dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
  2. Upload the local file to an Azure blob using the following sample code:

     from import BlobService
     import tables
     STORAGEACCOUNTNAME= <storage_account_name>
     LOCALFILENAME= <local_file_name>
     STORAGEACCOUNTKEY= <storage_account_key>
     CONTAINERNAME= <container_name>
     BLOBNAME= <blob_name>
     localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory
     #perform upload
         print ("Something went wrong with uploading to the blob:"+ BLOBNAME)
  3. Read the data from the Azure blob using Azure Machine Learning Import Data as shown in the image below:

reader blob