Explore data in Azure Blob Storage with pandas
This article covers how to explore data that is stored in Azure blob container using pandas Python package.
This task is a step in the Team Data Science Process.
Prerequisites
This article assumes that you have:
- Created an Azure storage account. If you need instructions, see Create an Azure Storage account
- Stored your data in an Azure Blob Storage account. If you need instructions, see Moving data to and from Azure Storage
Load the data into a pandas DataFrame
To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame. Here are the steps to follow for this procedure:
Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
from azure.storage.blob import BlockBlobService import pandas as pd import tables STORAGEACCOUNTNAME= <storage_account_name> STORAGEACCOUNTKEY= <storage_account_key> LOCALFILENAME= <local_file_name> CONTAINERNAME= <container_name> BLOBNAME= <blob_name> #download from blob t1=time.time() blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY) blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME) t2=time.time() print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))
Read the data into a pandas DataFrame from the downloaded file.
# LOCALFILE is the file path dataframe_blobdata = pd.read_csv(LOCALFILENAME)
Now you are ready to explore the data and generate features on this dataset.
Examples of data exploration using pandas
Here are a few examples of ways to explore data using pandas:
Inspect the number of rows and columns
print('the size of the data is: %d rows and %d columns' % dataframe_blobdata.shape)
Inspect the first or last few rows in the following dataset:
dataframe_blobdata.head(10) dataframe_blobdata.tail(10)
Check the data type each column was imported as using the following sample code
for col in dataframe_blobdata.columns: print(dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype)
Check the basic stats for the columns in the data set as follows
dataframe_blobdata.describe()
Look at the number of entries for each column value as follows
dataframe_blobdata['<column_name>'].value_counts()
Count missing values versus the actual number of entries in each column using the following sample code
miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count() print(miss_num)
If you have missing values for a specific column in the data, you can drop them as follows:
dataframe_blobdata_noNA = dataframe_blobdata.dropna() dataframe_blobdata_noNA.shape
Another way to replace missing values is with the mode function:
dataframe_blobdata_mode = dataframe_blobdata.fillna( {'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})
Create a histogram plot using variable number of bins to plot the distribution of a variable
dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar') np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
Look at correlations between variables using a scatterplot or using the built-in correlation function
# relationship between column_a and column_b using scatter plot plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>']) # correlation between column_a and column_b dataframe_blobdata[['<column_a>', '<column_b>']].corr()