Explore data in Azure blob storage with pandas

This article covers how to explore data that is stored in Azure blob container using pandas Python package.

This task is a step in the Team Data Science Process.

Prerequisites

This article assumes that you have:

Load the data into a pandas DataFrame

To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame. Here are the steps to follow for this procedure:

  1. Download the data from Azure blob with the following Python code sample using blob service. Replace the variable in the following code with your specific values:
from azure.storage.blob import BlockBlobService
import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
t1=time.time()
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))
  1. Read the data into a pandas DataFrame from the downloaded file.
#LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILE)

Now you are ready to explore the data and generate features on this dataset.

Examples of data exploration using pandas

Here are a few examples of ways to explore data using pandas:

  1. Inspect the number of rows and columns
print 'the size of the data is: %d rows and  %d columns' % dataframe_blobdata.shape
  1. Inspect the first or last few rows in the following dataset:
dataframe_blobdata.head(10)

dataframe_blobdata.tail(10)
  1. Check the data type each column was imported as using the following sample code
for col in dataframe_blobdata.columns:
    print dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype
  1. Check the basic stats for the columns in the data set as follows
dataframe_blobdata.describe()
  1. Look at the number of entries for each column value as follows
dataframe_blobdata['<column_name>'].value_counts()
  1. Count missing values versus the actual number of entries in each column using the following sample code
miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
print miss_num
  1. If you have missing values for a specific column in the data, you can drop them as follows:
dataframe_blobdata_noNA = dataframe_blobdata.dropna()
dataframe_blobdata_noNA.shape

Another way to replace missing values is with the mode function:

dataframe_blobdata_mode = dataframe_blobdata.fillna({'<column_name>':dataframe_blobdata['<column_name>'].mode()[0]})
  1. Create a histogram plot using variable number of bins to plot the distribution of a variable
dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')

np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
  1. Look at correlations between variables using a scatterplot or using the built-in correlation function
#relationship between column_a and column_b using scatter plot
plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])

#correlation between column_a and column_b
dataframe_blobdata[['<column_a>', '<column_b>']].corr()