Connect to data in Azure Data Lake Storage

Ingest data into Dynamics 365 Customer Insights using your Azure Data Lake Storage Gen2 account. Data ingestion can be full or incremental.

Prerequisites

  • Data ingestion supports Azure Data Lake Storage Gen2 accounts exclusively. You can't use Data Lake Storage Gen1 accounts to ingest data.

  • The Azure Data Lake Storage account must have hierarchical namespace enabled. The data must be stored in a hierarchical folder format that defines the root folder and has subfolders for each entity. The subfolders can have full data or incremental data folders.

  • To authenticate with an Azure service principal, make sure it's configured in your tenant. For more information, see Connect to an Azure Data Lake Storage Gen2 account with an Azure service principal.

  • The Azure Data Lake Storage you want to connect and ingest data from has to be in the same Azure region as the Dynamics 365 Customer Insights environment. Connections to a Common Data Model folder from a data lake in a different Azure region is not supported. To know the Azure region of the environment, go to Admin > System > About in Customer Insights.

  • Data stored in online services may be stored in a different location than where data is processed or stored in Dynamics 365 Customer Insights. By importing or connecting to data stored in online services, you agree that data can be transferred to and stored with Dynamics 365 Customer Insights. Learn more at the Microsoft Trust Center.

  • The Customer Insights service principal must be in one of the following roles to access the storage account. For more information, see Grant permissions to the service principal to access the storage account.

    • Storage Blob Data Reader
    • Storage Blob Data Owner
    • Storage Blob Data Contributor
  • Data in your Data Lake Storage should follow the Common Data Model standard for storage of your data and have the common data model manifest to represent the schema of the data files (*.csv or *.parquet). The manifest must provide the details of the entities such as entity columns and data types, and the data file location and file type. For more information, see The Common Data Model manifest. If the manifest is not present, Admin users with Storage Blob Data Owner or Storage Blob Data Contributor access can define the schema when ingesting the data.

Connect to Azure Data Lake Storage

  1. Go to Data > Data sources.

  2. Select Add data source.

  3. Select Azure data lake storage.

    Dialog box to enter connection details for Azure Data Lake.

  4. Enter a Name for the data source and an optional Description. The name uniquely identifies the data source and is referenced in downstream processes and can't be changed.

  5. Choose one of the following options for Connect your storage using. For more information, see Connect Customer Insights to an Azure Data Lake Storage Gen2 account with an Azure service principal.

    • Azure resource: Enter the Resource Id. Optionally, if you want to ingest data from a storage account through an Azure Private Link, select Enable Private Link. For more information, see Private Links.
    • Azure subscription: Select the Subscription and then the Resource group and Storage account. Optionally, if you want to ingest data from a storage account through an Azure Private Link, select Enable Private Link. For more information, see Private Links.

    Note

    You need one of the following roles either to the container or the storage account to create the data source:

    • Storage Blob Data Reader is sufficient to read from a storage account and ingest the data to Customer Insights.
    • Storage Blob Data Contributor or Owner is required if you want to edit the manifest files directly in Customer Insights.
  6. Choose the name of the Container that contains the data and schema (model.json or manifest.json file) to import data from, and select Next.

    Note

    Any model.json or manifest.json file associated with another data source in the environment won't show in the list. However, the same model.json or manifest.json file can be used for data sources in multiple environments.

  7. To create a new schema, go to Create a new schema file.

  8. To use an existing schema, navigate to the folder containing the model.json or manifest.cdm.json file. You can search within a directory to find the file.

  9. Select the json file and select Next. A list of available entities displays.

    Dialog box of a list of entities to select

  10. Select the entities you want to include.

    Dialog box showing Required for Primary key

    Tip

    To edit an entity in a JSON editing interface, select the entity and then Edit schema file. Make changes and select Save.

  11. For selected entities that require incremental ingestion, Required displays under Incremental refresh. For each of these entities, see Configure an incremental refresh for Azure Data Lake data sources.

  12. For selected entities where a primary key has not been defined, Required displays under Primary key. For each of these entities:

    1. Select Required. The Edit entity panel displays.
    2. Choose the Primary key. The primary key is an attribute unique to the entity. For an attribute to be a valid primary key, it shouldn't include duplicate values, missing values, or null values. String, integer, and GUID data type attributes are supported as primary keys.
    3. Optionally, change the partition pattern.
    4. Select Close to save and close the panel.
  13. Select the number of Attributes for each included entity. The Manage attributes page displays.

    Dialog box to select data profiling.

    1. Create new attributes, edit, or delete existing attributes. You can change the name, the data format, or add a semantic type.
    2. To enable analytics and other capabilities, select Data profiling for the whole entity or for specific attributes. By default, no entity is enabled for data profiling.
    3. Select Done.
  14. Select Save. The Data sources page opens showing the new data source in Refreshing status.

    Tip

    There are statuses for tasks and processes. Most processes depend on other upstream processes, such as data sources and data profiling refreshes.

    Select the status to open the Progress details pane and view the progress of the tasks. To cancel the job, select Cancel job at the bottom of the pane.

    Under each task, you can select See details for more progress information, such as processing time, the last processing date, and any applicable errors and warnings associated with the task or process. Select the View system status at the bottom of the panel to see other processes in the system.

Loading data can take time. After a successful refresh, the ingested data can be reviewed from the Entities page.

Create a new schema file

  1. Select New schema file.

  2. Enter a name for the file and select Save.

  3. Select New entity. The New Entity panel displays.

  4. Enter the entity name and choose the Data files location.

    • Multiple .csv or .parquet files: Browse to the root folder, select the pattern type, and enter the expression.
    • Single .csv or .parquet files: Browse to the .csv or .parquet file and select it.

    Dialog box to create a new entity with Data files location highlighted.

  5. Select Save.

    Dialog box to define or auto generate attributes.

  6. Select define the attributes to manually add the attributes, or select auto generate them. To define the attributes, enter a name, select the data format and optional semantic type. For auto-generated attributes:

    1. After the attributes are auto-generated, select Review attributes. The Manage attributes page displays.

    2. Ensure the data format is correct for each attribute.

    3. To enable analytics and other capabilities, select Data profiling for the whole entity or for specific attributes. By default, no entity is enabled for data profiling.

      Dialog box to select data profiling.

    4. Select Done. The Select entities page displays.

  7. Continue to add entities and attributes, if applicable.

  8. After all entities have been added, select Include to include the entities in the data source ingestion.

    Dialog box showing Required for Primary key

  9. For selected entities that require incremental ingestion, Required displays under Incremental refresh. For each of these entities, see Configure an incremental refresh for Azure Data Lake data sources.

  10. For selected entities where a primary key has not been defined, Required displays under Primary key. For each of these entities:

    1. Select Required. The Edit entity panel displays.
    2. Choose the Primary key. The primary key is an attribute unique to the entity. For an attribute to be a valid primary key, it shouldn't include duplicate values, missing values, or null values. String, integer, and GUID data type attributes are supported as primary keys.
    3. Optionally, change the partition pattern.
    4. Select Close to save and close the panel.
  11. Select Save. The Data sources page opens showing the new data source in Refreshing status.

    Tip

    There are statuses for tasks and processes. Most processes depend on other upstream processes, such as data sources and data profiling refreshes.

    Select the status to open the Progress details pane and view the progress of the tasks. To cancel the job, select Cancel job at the bottom of the pane.

    Under each task, you can select See details for more progress information, such as processing time, the last processing date, and any applicable errors and warnings associated with the task or process. Select the View system status at the bottom of the panel to see other processes in the system.

Loading data can take time. After a successful refresh, the ingested data can be reviewed from the Entities page.

Edit an Azure Data Lake Storage data source

You can update the Connect to storage account using option. For more information, see Connect Customer Insights to an Azure Data Lake Storage Gen2 account with an Azure service principal. To connect to a different container from your storage account, or change the account name, create a new data source connection.

  1. Go to Data > Data sources.

  2. Next to the data source you'd like to update, select Edit.

    Dialog box to edit Azure Data Lake data source.

  3. Change any of the following information:

    • Description

    • Connect your storage using and connection information. You can't change Container information when updating the connection.

      Note

      One of the following roles must be assigned to the storage account or container:

      • Storage Blob Data Reader
      • Storage Blob Data Owner
      • Storage Blob Data Contributor
    • Enable Private Link if you want to ingest data from a storage account through an Azure Private Link. For more information, see Private Links.

  4. Select Next.

  5. Change any of the following:

    • Navigate to a different model.json or manifest.json file with a different set of entities from the container.

    • To add additional entities to ingest, select New entity.

    • To remove any already selected entities if there are no dependencies, select the entity and Delete.

      Important

      If there are dependencies on the existing model.json or manifest.json file and the set of entities, you'll see an error message and can't select a different model.json or manifest.json file. Remove those dependencies before changing the model.json or manifest.json file or create a new data source with the model.json or manifest.json file that you want to use to avoid removing the dependencies.

    • To change data file location or the primary key, select Edit.

    • To change the incremental ingestion data, see Configure an incremental refresh for Azure Data Lake data sources.

    • Only change the entity name to match the entity name in the .json file.

      Note

      Always keep the entity name in Customer Insights the same as the entity name inside the model.json or manifest.json file after ingestion. Customer Insights validates all entity names with the model.json or manifest.json during every system refresh. If an entity name is changed either inside Customer Insights or outside, an error occurs because Customer Insights cannot find the new entity name in the .json file. If an ingested entity name was accidentally changed, edit the entity name in Customer Insights to match the name in the .json file.

  6. Select Attributes to add or change attributes, or to enable data profiling. Then select Done.

  7. Click Save to apply your changes and return to the Data sources page.

    Tip

    There are statuses for tasks and processes. Most processes depend on other upstream processes, such as data sources and data profiling refreshes.

    Select the status to open the Progress details pane and view the progress of the tasks. To cancel the job, select Cancel job at the bottom of the pane.

    Under each task, you can select See details for more progress information, such as processing time, the last processing date, and any applicable errors and warnings associated with the task or process. Select the View system status at the bottom of the panel to see other processes in the system.