PySpark analytics samples for Microsoft Academic Graph

Illustrates how to perform analytics for Microsoft Academic Graph using PySpark on Azure Databricks.

Sample projects

Prerequisites

Before running these examples, you need to complete the following setups:

Gather the information that you need

Before you begin, you should have these items of information:

✔️ The name of your Azure Storage (AS) account containing MAG dataset from Get Microsoft Academic Graph on Azure storage.

✔️ The access key of your Azure Storage (AS) account from Get Microsoft Academic Graph on Azure storage.

✔️ The name of the container in your Azure Storage (AS) account containing MAG dataset.

✔️ The name of the output container in your Azure Storage (AS) account.

Import PySparkMagClass.py as a notebook

In this section, you import PySparkMagClass.py as a shared notebook in Azure Databricks workspace. You could run this utility notebook from other notebooks later.

  1. Save samples\PySparkMagClass.py in MAG dataset to local drive.

  2. In the Azure portal, go to the Azure Databricks service that you created, and select Launch Workspace.

  3. On the left, select Workspace. From the Workspace > Shared drop-down, select Import.

    Import a notebook in Databricks

  4. Drag and drop PySparkMagClass.py to the Import Notebook dialog box

    Provide details for a notebook in Databricks

  5. Select Import. This will create a notebook with path "/Shared/PySparkMagClass". No need to run this notebook.

    Note

    When importing this notebook under Shared folder. The full path of this notebook is "/Shared/PySparkMagClass". If you import it under other folders, note the actual full path and use it in following sections.

Create a new notebook

In this section, you create a new notebook in Azure Databricks workspace.

  1. On the left, select Workspace. From the Workspace drop-down, select Create > Notebook. Optionally, you could create this notebook in Users level.

    Create a notebook in Databricks

  2. In the Create Notebook dialog box, enter a name for the notebook. Select Python as the language.

    Provide details for a notebook in Databricks

  3. Select Create.

Create first notebook cell

In this section, you create the first notebook cell to run PySparkMagClass notebook.

  1. Copy and paste following code block into the first cell.

    %run "/Shared/PySparkMagClass"
    
  2. Press the SHIFT + ENTER keys to run the code in this block. It defines MicrosoftAcademicGraph class.

Define configration variables

In this section, you add a new notebook cell and define configration variables.

  1. Copy and paste following code block into the first cell.

    # Define configration variables
    AzureStorageAccount = '<AzureStorageAccount>'     # Azure Storage (AS) account containing MAG dataset
    AzureStorageAccessKey = '<AzureStorageAccessKey>' # Access Key of the Azure Storage (AS) account
    MagContainer = '<MagContainer>'                   # The container name in Azure Storage (AS) account containing MAG dataset, usually in forms of mag-yyyy-mm-dd
    OutputContainer = '<OutputContainer>'             # The container name in Azure Storage (AS) account where the output goes to
    
  2. In this code block, replace <AzureStorageAccount>, <AzureStorageAccessKey>, and <MagContainer> placeholder values with the values that you collected while completing the prerequisites of this sample.

    Value Description
    <AzureStorageAccount> The name of your Azure Storage account.
    <AzureStorageAccessKey> The access key of your Azure Storage account.
    <MagContainer> The container name in Azure Storage account containing MAG dataset, usually in the form of mag-yyyy-mm-dd.
    <OutputContainer> The container name in Azure Storage (AS) account where the output goes to
  3. Press the SHIFT + ENTER keys to run the code in this block.

Create MicrosoftAcademicGraph and AzureStorageUtil instances

In this section, you create a MicrosoftAcademicGraph instance to access MAG dataset, and an AzureStorageUtil instance to access other Azure Storage files.

  1. Copy and paste the following code block in a new cell.

    # Create a MicrosoftAcademicGraph instance to access MAG dataset
    mag = MicrosoftAcademicGraph(container=MagContainer, account=AzureStorageAccount, key=AzureStorageAccessKey)
    
    # Create a AzureStorageUtil instance to access other Azure Storage files
    asu = AzureStorageUtil(container=OutputContainer, account=AzureStorageAccount, key=AzureStorageAccessKey)
    
  2. Press the SHIFT + ENTER keys to run the code in this block.

Run scripts in the repository

  1. Copy content in a script and paste into a new cell.

  2. Press the SHIFT + ENTER keys to run the code in this cell. Please note that some scripts take more than 10 minutes to complete.

View results with Microsoft Azure Storage Explorer

View result with Microsoft Azure Storage Explorer

Resources