Enterprise websites connector

With the Enterprise websites connector, your organization can index articles and content from its internal-facing websites. After you configure the connector and sync content from the website, end users can search for that content from any Microsoft Search client.

This article is for Microsoft 365 administrators or anyone who configures, runs, and monitors an Enterprise websites connector. It explains how to configure your connector and connector capabilities, limitations, and troubleshooting techniques.

Connect to a data source

To connect to your data source, you need to fill in the root URL of the website and the type of authentication you'd like to use: None, Basic Authentication, or OAuth 2.0 with Azure Active Directory (Azure AD).

URL

Use the URL field to specify the root of the website that you'd like to crawl. The enterprise websites connector will use this URL as the starting point and follow all the links from this URL for its crawl.

Authentication

Basic Authentication requires a username and password. Create this bot account by using the Microsoft 365 admin center.

OAuth 2.0 with Azure AD requires a resource ID, Client ID, and Client Secret.

For more information, see Authorize access to Azure Active Directory web applications using OAuth 2.0 code grant flow. Register with the following values:

Name: Microsoft Search
Redirect_URI: https://gcs.office.com/v1.0/admin/oauth/callback

To get the values for the resource, client_id, and client_secret, go to Use the authorization code to request an access token on the redirect URL webpage.

For even more information, see Quickstart: Register an application with the Microsoft identity platform.

Add URLs to exclude

You can optionally create an Exclusion list to exclude some URLs from getting crawled if that content is sensitive or not worth crawling. To create an exclusion list, browse through the root URL. You have the option to add the excluded URLs to the list during the configuration process.

Manage search permissions

The Enterprise websites connector only supports search permissions visible to Everyone. Indexed data appears in the search results and is visible to all users in the organization.

Assign property labels

You can assign a source property to each label by choosing from a menu of options. While this step is not mandatory, having some property labels will improve the search relevance and ensure more accurate search results for end users.

Manage schema

On the Manage Schema screen, you have the option to change the schema attributes (queryable, searchable, retrievable, and refinable) associated with the properties, add optional aliases, and choose the Content property.

Set the refresh schedule

The Enterprise websites connector only supports a full refresh. This means that the connector will recrawl all the website's content during every refresh. To make sure the connector gets enough time to crawl the content, we recommend that you set a large refresh schedule interval. We recommend a scheduled refresh between one and two weeks.

Troubleshooting

When reading the website's content, the crawl may encounter some source errors which are represented by the detailed error codes below. To get more information on the types of errors, go to the error details page after selecting the connection. Click on the error code to see more detailed errors. Also refer to Manage your connector to learn more.

Detailed Error code Error message
6001 The site that is being tried to index is not reachable
6005 The source page that is being tried to index has been blocked by as per robots.txt configuration.
6008 Unable to resolve the DNS
6009 For all client side errors (Except HTTP 404, 408), please refer to HTTP 4xx error codes for details.
6013 The source page that is being tried to index could not be found. (HTTP 404 error)
6018 The source page is not responding, and the request has timed out. (HTTP 408 error)
6021 The source page that is being tried to index has no textual content on the page.
6023 The source page that is being tried to index is unsupported (not a HTML page)
6024 The source page that is being tried to index has unsupported content.
  • Errors 6001-6013 occur when the data source is not reachable due to a network issue or when the data source itself is deleted, moved, or renamed. Check if the data source details provided are still valid.
  • Errors 6021-6024 error occur when the data source contains non-textual content on the page or when the page is not an HTML. Please check the data source and add this page in exclusion list or ignore the error.

Limitations

The Enterprise websites connector doesn't support searching data on dynamic webpages. Examples of those webpages live in content management systems like Confluence and Unily or databases that store website content.