Sample data in Azure blob storage
This article covers sampling data stored in Azure blob storage by downloading it programmatically and then sampling it using procedures written in Python.
Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. Sampling facilitates data understanding, exploration, and feature engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.
This sampling task is a step in the Team Data Science Process (TDSP).
Download and down-sample data
Download the data from Azure blob storage using the Blob service from the following sample Python code:
from azure.storage.blob import BlobService import tables STORAGEACCOUNTNAME= <storage_account_name> STORAGEACCOUNTKEY= <storage_account_key> LOCALFILENAME= <local_file_name> CONTAINERNAME= <container_name> BLOBNAME= <blob_name> #download from blob t1=time.time() blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY) blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME) t2=time.time() print(("It takes %s seconds to download "+blobname) % (t2 - t1))
Read data into a Pandas data-frame from the file downloaded above.
import pandas as pd #directly ready from file on disk dataframe_blobdata = pd.read_csv(LOCALFILE)
Down-sample the data using the
# A 1 percent sample sample_ratio = 0.01 sample_size = np.round(dataframe_blobdata.shape * sample_ratio) sample_rows = np.random.choice(dataframe_blobdata.index.values, sample_size) dataframe_blobdata_sample = dataframe_blobdata.ix[sample_rows]
Now you can work with the above data frame with the one Percent sample for further exploration and feature generation.
Upload data and read it into Azure Machine Learning
You can use the following sample code to down-sample the data and use it directly in Azure Machine Learning:
Write the data frame to a local file
dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
Upload the local file to an Azure blob using the following sample code:
from azure.storage.blob import BlobService import tables STORAGEACCOUNTNAME= <storage_account_name> LOCALFILENAME= <local_file_name> STORAGEACCOUNTKEY= <storage_account_key> CONTAINERNAME= <container_name> BLOBNAME= <blob_name> output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY) localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory try: #perform upload output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed) except: print ("Something went wrong with uploading to the blob:"+ BLOBNAME)
Read the data from the Azure blob using Azure Machine Learning Import Data as shown in the image below: