Enterprise websites Microsoft Graph connector

The Enterprise websites Microsoft Graph connector allows your organization to index articles and content from its internal-facing websites. After you configure the connector and sync content from the website, end users can search for that content from any Microsoft Search client.

Note

Read the Set up Microsoft Graph connectors in the Microsoft 365 admin center article to understand the general connectors setup instructions.

This article is for anyone who configures, runs, and monitors an Enterprise websites connector. It supplements the general setup process, and shows instructions that apply only for the Enterprise websites connector. This article also includes information about Troubleshooting.

Step 1: Add a connector in the Microsoft 365 admin center

Follow the general setup instructions.

Step 2: Name the connection

Follow the general setup instructions.

Step 3: Configure the connection settings

To connect to your data source, fill in the root URL of the website, select a crawl source, and the type of authentication you'd like to use: None, Basic Authentication, or OAuth 2.0 with Azure Active Directory (Azure AD). After you complete this information, select Test Connection to verify your settings.

URL

Use the URL field to specify the root of the website that you'd like to crawl. The enterprise websites connector will use this URL as the starting point and follow all the links from this URL for its crawl.

Note

You can index up to 20 different site URLs in a single connection. In the URLs field, enter the site URLs separated by commas (,). For example, https://www.contoso.com,https://www.contosoelectronics.com.

Crawl websites listed in the sitemap

When selected the connector will only crawl the URLs listed in the sitemap. If not selected or no site map is found, the connector will do a deep crawl of all the links found on the root URL of the site.

Dynamic site configuration

If your website contains dynamic content, for example, webpages that live in content management systems like Confluence or Unily, you can enable a dynamic crawler. To turn it on, select Enable crawl for dynamic sites. The crawler will wait for dynamic content to render before it begins crawling.

Screenshot of Connection Settings pane for Enterprise Web connector.

In addition to the check box, there are three optional fields available:

  1. DOM Ready: Enter the DOM element the crawler should use as the signal that the content is fully rendered and the crawl should begin.
  2. Headers to Add: Specify which HTTP headers the crawler should include when sending that specific web URL. You can set multiple headers for different websites. We suggest including auth token values.
  3. Headers to Skip: Specify any unnecessary headers that should be excluded from dynamic crawling requests.

Note

Dynamic crawling is only supported for Agent crawl mode.

Crawl mode: Cloud or On-premises

The crawl mode determines the type of websites you want to index, either cloud or on-premises. For your cloud websites, select Cloud as the crawl mode.

Also, the connector now supports crawling of on-premises websites. To access your on-premises data, you must first install and configure the connector agent. To learn more, see Microsoft Graph connector agent.

For your on-premises websites, select Agent as the crawl mode and in the On-prem Agent field, choose the Graph connector agent that you installed and configured earlier.

Authentication

Basic Authentication requires a username and password.

OAuth 2.0 with Azure AD requires a resource ID, Client ID, and a client Secret. OAuth 2.0 only works with Cloud mode.

The resource ID, client ID and client secret values will depend on how you did the setup for Azure Active Directory (Azure AD) based authentication for your website:

  1. If you're using an application both as an identity provider and the client app to access the website, the client ID and the resource ID will be the application ID of the app, and the client secret will be the secret that you generated in the app.

    After the client app is configured, make sure you create a new client secret by going to the Certificates & Secrets section of the app. Copy the client secret value shown in the page because it won't be displayed again.

    In the following screenshots you can see the steps to obtain the client ID, client secret, and set up the app if you're creating the app on your own.

    • View of the settings on the branding section:

    • View of the settings on authentication section:

      Note

      It is not required to have the above specified route for Redirect URI in your website. Only if you use the user token sent by Azure in your website for authentication you will need to have the route.

    • View of the client ID on the Essentials section:

    • View of the client secret on the Certificates & secrets section:

  2. If you're using an application as an identity provider for your website as the resource, and a different application to access the website, the client ID will be the application ID of your second app and the client secret will be the secret configured in the second app. However, the resource ID will be the ID of your first app.

    You don't need to configure a client secret in this application, but you'll need to add an app role in the App roles section of the app which will later be assigned to your client application. In the following screenshots you can see how to add an app role.

    • Creating a new app role:

    • Editing the new app role:

      After configuring the resource app, you need to create the client app and give it permissions to access the resource app by adding the app role configured above in the API permissions of the client app.

      Note

      To see how to grant permissions to the client app see Quickstart: Configure a client application to access a web API.

    The following screenshots show the section to grant permissions to the client app.

    • Adding a permission:

    • Selecting the permissions:

    • Adding the permissions:

    Once the permissions are assigned, you'll need to create a new client secret for this application by going to the Certificates & secrets section. Copy the client secret value shown in the page as it won't be displayed again. Later use, the application ID from this app as the client ID, the secret from this app as the client secret and application ID of the first app as the resource ID in the admin center.

    Windows authentication is only available in agent mode. It requires username, domain and password. You need to provide the username and domain in the Username field, in any of the following formats: domain\username, or username@domain. A password must be entered in the Password field. For Windows authentication, the username provided must also be an administrator in the server where the agent is installed.

Step 3a: Meta tag settings

The connector fetches any meta tags your root URLs may have and shows them. You can select which tags to include for crawling.

Meta tag settings with author, locale, and other tags selected.

Selected meta tags will also show up on the Schema page, where you can manage them further (Queryable, Searchable, Retrievable, Refinable).

Step 3b: Add URLs to exclude (Optional crawl restrictions)

There are two ways to prevent pages from being crawled: disallow them in your robots.txt file or add them to the Exclusion list.

Support for robots.txt

The connector checks to see if there's a robots.txt file for your root site and, if one exists, it will follow and respect the directions found within that file. If you don't want the connector to crawl certain pages or directories on your site, you can call out those pages or directories in the "Disallow" declarations in your robots.txt file.

Add URLs to exclude

You can optionally create an Exclusion list to exclude some URLs from getting crawled if that content is sensitive or not worth crawling. To create an exclusion list, browse through the root URL. You can add the excluded URLs to the list during the configuration process.

Step 4: Assign property labels

You can assign a source property to each label by choosing from a menu of options. While this step isn't mandatory, having some property labels will improve the search relevance and ensure more accurate search results for end users.

Step 5: Manage schema

On the Manage Schema screen, you can change the schema attributes (the options are Query, Search, Retrieve, and Refine) associated with the properties, add optional aliases, and choose the Content property.

Step 6: Manage search permissions

The Enterprise websites connector only supports search permissions visible to Everyone. Indexed data appears in the search results and is visible to all users in the organization.

Step 7: Set the refresh schedule

The Enterprise websites connector only supports a full refresh. This means that the connector will recrawl all the website's content during every refresh. To make sure the connector gets enough time to crawl the content, we recommend that you set a large refresh schedule interval. We recommend a scheduled refresh between one and two weeks.

Step 8: Review connection

Follow the general setup instructions.

Troubleshooting

When reading the website's content, the crawl may encounter some source errors, which are represented by the detailed error codes below. To get more information on the types of errors, go to the error details page after selecting the connection. Select the error code to see more detailed errors. Also refer to Monitor your connections to learn more.

Detailed Error code Error message
6001 The site that is being tried to index isn't reachable
6005 The source page that is being tried to index has been blocked by as per robots.txt configuration.
6008 Unable to resolve the DNS
6009 For all client-side errors (Except HTTP 404, 408), refer to HTTP 4xx error codes for details.
6013 The source page that is being tried to index couldn't be found. (HTTP 404 error)
6018 The source page isn't responding, and the request has timed out. (HTTP 408 error)
6021 The source page that is being tried to index has no textual content on the page.
6023 The source page that is being tried to index is unsupported (not an HTML page)
6024 The source page that is being tried to index has unsupported content.
  • Errors 6001-6013 occur when the data source isn't reachable due to a network issue or when the data source itself is deleted, moved, or renamed. Check if the data source details provided are still valid.
  • Errors 6021-6024 occur when the data source contains non-textual content on the page or when the page isn't an HTML. Check the data source and add this page in exclusion list or ignore the error.