使用 pandas 探索 Azure Blob 儲存體中的資料Explore data in Azure Blob Storage with pandas

本文涵蓋如何使用 Pandas Python 封裝瀏覽儲存在 Azure blob 容器的資料。This article covers how to explore data that is stored in Azure blob container using pandas Python package.

此工作是 Team Data Science Process 中的一個步驟。This task is a step in the Team Data Science Process.

必要條件Prerequisites

本文假設您已經:This article assumes that you have:

將資料載入 Pandas 資料框架Load the data into a pandas DataFrame

若要探索和操作資料集,必須先從 Blob 來源將資料集下載至本機檔案,然後將其載入 Pandas 資料框架。To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame. 以下是此程序的遵循步驟:Here are the steps to follow for this procedure:

  1. 使用 Blob 服務,透過下列 Python 程式碼範例,從 Azure Blob 下載資料。Download the data from Azure blob with the following Python code sample using Blob service. 使用您的特定值來取代下列程式碼中的變數:Replace the variable in the following code with your specific values:

    from azure.storage.blob import BlockBlobService
    import pandas as pd
    import tables
    
    STORAGEACCOUNTNAME= <storage_account_name>
    STORAGEACCOUNTKEY= <storage_account_key>
    LOCALFILENAME= <local_file_name>
    CONTAINERNAME= <container_name>
    BLOBNAME= <blob_name>
    
    #download from blob
    t1=time.time()
    blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
    blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
    t2=time.time()
    print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))
    
  2. 從下載的檔案中,將資料讀取至 Pandas 資料框架。Read the data into a pandas DataFrame from the downloaded file.

    # LOCALFILE is the file path
    dataframe_blobdata = pd.read_csv(LOCALFILENAME)
    

現在您已經準備好探索資料並在此資料集上產生功能。Now you are ready to explore the data and generate features on this dataset.

使用 Pandas 的資料探索範例Examples of data exploration using pandas

以下是數個可使用 Pandas 探索資料的範例方式:Here are a few examples of ways to explore data using pandas:

  1. 檢查 資料列和資料行的數目Inspect the number of rows and columns

    print('the size of the data is: %d rows and  %d columns' % dataframe_blobdata.shape)
    
  2. 檢查 資料集中的前幾個或最後幾個 資料列Inspect the first or last few rows in the following dataset:

    dataframe_blobdata.head(10)
    
    dataframe_blobdata.tail(10)
    
  3. 檢查使用下列程式碼範例匯入之每個資料行的 資料類型Check the data type each column was imported as using the following sample code

    for col in dataframe_blobdata.columns:
        print(dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype)
    
  4. 檢查資料集中資料行的 基本統計資料 ,如下所示Check the basic stats for the columns in the data set as follows

    dataframe_blobdata.describe()
    
  5. 查看每個資料行值的項目數,如下所示Look at the number of entries for each column value as follows

    dataframe_blobdata['<column_name>'].value_counts()
    
  6. 計算遺漏值 與實際項目數Count missing values versus the actual number of entries in each column using the following sample code

    miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
    print(miss_num)
    
  7. 如果您在資料的特定資料行中有 遺漏值 ,則可卸除它們,如下所示:If you have missing values for a specific column in the data, you can drop them as follows:

    dataframe_blobdata_noNA = dataframe_blobdata.dropna()
    dataframe_blobdata_noNA.shape
    

    取代遺漏值的另一種方式是使用模式函式:Another way to replace missing values is with the mode function:

    dataframe_blobdata_mode = dataframe_blobdata.fillna(
        {'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})
    
  8. 使用變動數目的分類收納組來建立 長條圖 ,以繪製變數的分佈Create a histogram plot using variable number of bins to plot the distribution of a variable

    dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
    
    np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
    
  9. 使用散佈圖或使用內建的相互關聯函式,來查看變數之間的 相互關聯Look at correlations between variables using a scatterplot or using the built-in correlation function

    # relationship between column_a and column_b using scatter plot
    plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])
    
    # correlation between column_a and column_b
    dataframe_blobdata[['<column_a>', '<column_b>']].corr()