Quickstart: Translate text and recognize entities using the Import data wizard

Learn how AI enrichment in Azure Cognitive Search adds language detection, text translation, and entity recognition to create searchable content in a search index.

In this quickstart, you'll run the Import data wizard to analyze French and Spanish descriptions of several national museums located in Spain. Output is a searchable index containing translated text and entities, queryable in the portal using Search explorer.

To prepare, you'll create a few resources and upload sample files before running the wizard.

Prefer to start with code? Try the .NET tutorial, Python tutorial, or REST tutorial instead.

Prerequisites

Before you begin, have the following prerequisites in place:

Note

This quickstart also uses Cognitive Services for the AI. Because the workload is so small, Cognitive Services is tapped behind the scenes for free processing for up to 20 transactions. This means that you can complete this exercise without having to create an additional Cognitive Services resource.

Set up your data

In the following steps, set up a blob container in Azure Storage to store heterogeneous content files.

  1. Download sample data from GitHub. There are multiple data sets. Use the files in the spanish-museums folder for this quickstart.

  2. Upload the sample data to a blob container.

    1. Sign in to the Azure portal and find your storage account.
    2. In the left navigation pane, select Containers.
    3. Create a container named "spanish-museums". Use the default public access level.
    4. In the "spanish-museums" container, select Upload to upload the files from your local spanish-museums folder.

You should have 10 files containing French and Spanish descriptions of national museums located in Spain.

List of docx files in a blob container

You are now ready to move on the Import data wizard.

Run the Import data wizard

  1. Sign in to the Azure portal with your Azure account.

  2. Find your search service and on the Overview page, click Import data on the command bar to set up cognitive enrichment in four steps.

    Screenshot of the Import data command

Step 1 - Create a data source

  1. In Connect to your data, choose Azure Blob Storage. Choose an existing connection to the storage account and container you created. Give the data source a name, and use default values for the rest.

    Azure blob configuration

Step 2 - Add cognitive skills

Next, configure AI enrichment to invoke language detection, text translation, and entity recognition.

  1. For this quickstart, we are using the Free Cognitive Services resource. The sample data consists of 10 files, so the daily, per-indexer allotment of 20 free transactions on Cognitive Services is sufficient for this quickstart.

    Attach free Cognitive Services processing

  2. In the same page, expand Add enrichments and make five selections:

    Choose entity recognition (people, organizations, locations)

    Choose language detection and text translation

    Attach Cognitive Services select services for skillset

    In blobs, the "Content" field contains the content of the file. In the sample data, the content is multiple paragraphs about a given museum, in either French or Spanish. The "Granularity" is the field itself. Some skills work better on smaller chunks of text, but for the skills in this quickstart, field granularity is sufficient.

Step 3 - Configure the index

An index contains your searchable content and the Import data wizard can usually infer the schema for you by sampling the data. In this step, review the generated schema and potentially revise any settings. Below is the default schema created for the demo data set.

For this quickstart, the wizard does a good job setting reasonable defaults:

  • Default fields are based on properties for existing blobs plus new fields to contain enrichment output (for example, people, organizations, locations). Data types are inferred from metadata and by data sampling.

  • Default document key is metadata_storage_path (selected because the field contains unique values).

  • Default attributes are Retrievable and Searchable. Searchable allows full text search a field. Retrievable means field values can be returned in results. The wizard assumes you want these fields to be retrievable and searchable because you created them via a skillset.

  • Select the filterable checkbox for "Language". The wizard won't set the folder for you, but the ability to filter by language is useful in this demo given that there are multiple languages.

    Index fields

Marking a field as Retrievable does not mean that the field must be present in the search results. You can precisely control search results composition by using the $select query parameter to specify which fields to include. For text-heavy fields like content, the $select parameter is your solution for shaping manageable search results to the human users of your application, while ensuring client code has access to all the information it needs via the Retrievable attribute.

Step 4 - Configure the indexer

The indexer is a high-level resource that drives the indexing process. It specifies the data source name, a target index, and frequency of execution. The Import data wizard creates several objects, and of them is always an indexer that you can run repeatedly.

  1. In the Indexer page, you can accept the default name and click the Once schedule option to run it immediately.

    Indexer definition

  2. Click Submit to create and simultaneously run the indexer.

Monitor status

Cognitive skills indexing takes longer to complete than typical text-based indexing. To monitor progress, go to the Overview page and select the Indexers tab in the middle of page.

Indexer status

To check details about execution status, select an indexer from the list.

Query in Search explorer

After an index is created, you can run queries to return results. In the portal, use Search explorer for this task.

  1. On the search service dashboard page, click Search explorer on the command bar.

  2. Select Change Index at the top to select the index you created.

  3. In Query string, enter a search string to query the index, such as search="picasso museum" &$select=people,organizations,locations,language,translated_text &$count=true &$filter=language eq 'fr', and then select Search.

    Query string in search explorer

Results are returned as JSON, which can be verbose and hard to read, especially in large documents originating from Azure blobs. Some tips for searching in this tool include the following techniques:

  • Append $select to specify which fields to include in results.

  • Use CTRL-F to search within the JSON for specific properties or terms.

    Search explorer example

Query strings are case-sensitive so if you get an "unknown field" message, check Fields or Index Definition (JSON) to verify name and case.

Clean up resources

When you're working in your own subscription, it's a good idea at the end of a project to identify whether you still need the resources you created. Resources left running can cost you money. You can delete resources individually or delete the resource group to delete the entire set of resources.

You can find and manage resources in the portal, using the All resources or Resource groups link in the left-navigation pane.

If you are using a free service, remember that you are limited to three indexes, indexers, and data sources. You can delete individual items in the portal to stay under the limit.

Next steps

Cognitive Search has other built-in skills that can be exercised in the Import data wizard. As a next step, try the OCR and image analysis skills to create text-searchable content from image files.