What is "cognitive search" in Azure Search?
Cognitive search is an AI feature in Azure Search, used to extract text from images, blobs, and other unstructured data sources - enriching the content to make it more searchable in an Azure Search index. Extraction and enrichment are implemented through cognitive skills attached to an indexing pipeline. AI enrichments are supported in the following ways:
Natural language processing skills include entity recognition, language detection, key phrase extraction, text manipulation, and sentiment detection. With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.
Image processing skills include Optical Character Recognition (OCR) and identification of visual features, such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. You can create text-representations of image content, searchable using all the query capabilities of Azure Search.
Natural language and image processing is applied during the data ingestion phase, with results becoming part of a document's composition in a searchable index in Azure Search. Data is sourced as an Azure data set and then pushed through an indexing pipeline using whichever built-in skills you need. The architecture is extensible so if the built-in skills are not sufficient, you can create and attach custom skills to integrate custom processing. Examples might be a custom entity module or document classifier targeting a specific domain such as finance, scientific publications, or medicine.
Starting December 21, 2018, you can attach a Cognitive Services resource with an Azure Search skillset. This allows us to start charging for skillset execution. On this date, we also began charging for image extraction as part of the document-cracking stage. Text extraction from documents continues to be offered at no additional cost.
Execution of built-in skills is a Cognitive Services charge, billed at the existing pay-as-you go price . Image extraction pricing is an Azure Search charge, currently billed at preview pricing as described on the Azure Search pricing page.
Components of cognitive search
The cognitive search pipeline is based on Azure Search indexers that crawl data sources and provide end-to-end index processing. Skills are now attached to indexers, intercepting and enriching documents according to the skillset you define. Once indexed, you can access content via search requests through all query types supported by Azure Search. If you are new to indexers, this section walks you through the steps.
Step 1: Connection and document cracking phase
At the start of the pipeline, you have unstructured text or non-text content (such as image and scanned document JPEG files). Data must exist in an Azure data storage service that can be accessed by an indexer. Indexers can "crack" source documents to extract text from source data.
Supported sources include Azure blob storage, Azure table storage, Azure SQL Database, and Azure Cosmos DB. Text-based content can be extracted from the following file types: PDFs, Word, PowerPoint, CSV files. For the full list, see Supported formats.
Step 2: Cognitive skills and enrichment phase
Enrichment is through cognitive skills performing atomic operations. For example, once you have text content from a PDF, you can apply entity recognition language detection, or key phrase extraction to produce new fields in your index that are not available natively in the source. Altogether, the collection of skills used in your pipeline is called a skillset.
A skillset is based on predefined cognitive skills or custom skills you provide and connect to the skillset. A skillset can be minimal or highly complex, and determines not only the type of processing, but also the order of operations. A skillset plus the field mappings defined as part of an indexer fully specifies the enrichment pipeline. For more information about pulling all of these pieces together, see Define a skillset.
Internally, the pipeline generates a collection of enriched documents. You can decide which parts of the enriched documents should be mapped to indexable fields in your search index. For example, if you applied the key phrases extraction and the entity recognition skills, then those new fields would become part of the enriched document, and they can be mapped to fields on your index. See Annotations to learn more about input/output formations.
Step 3: Search index and query-based access
When processing is finished, you have a search corpus consisting of enriched documents, fully text-searchable in Azure Search. Querying the index is how developers and users access the enriched content generated by the pipeline.
The index is like any other you might create for Azure Search: you can supplement with custom analyzers, invoke fuzzy search queries, add filtered search, or experiment with scoring profiles to reshape the search results.
Indexes are generated from an index schema that defines the fields, attributes, and other constructs attached to a specific index, such as scoring profiles and synonym maps. Once an index is defined and populated, you can index incrementally to pick up new and updated source documents. Certain modifications require a full rebuild. You should use a small data set until the schema design is stable. For more information, see How to rebuild an index.
Key features and concepts
|Skillset||A top-level named resource containing a collection of skills. A skillset is the enrichment pipeline. It is invoked during indexing by an indexer.||Define a skillset|
|Cognitive skill||An atomic transformation in an enrichment pipeline. Often, it is a component that extracts or infers structure, and therefore augments your understanding of the input data. Almost always, the output is text-based and the processing is natural language processing or image processing that extracts or generates text from image inputs. Output from a skill can be mapped to a field in an index, or used as an input for a downstream enrichment. A skill is either predefined and provided by Microsoft, or custom: created and deployed by you.||Predefined skills|
|Data extraction||Covers a broad range of processing, but pertaining to cognitive search, the named entity recognition skill is most typically used to extract data (an entity) from a source that doesn't provide that information natively.||Named Entity Recognition Skill|
|Image processing||Infers text from an image, such as the ability to recognize a landmark, or extracts text from an image. Common examples include OCR for lifting characters from a scanned document (JPEG) file, or recognizing a street name in a photograph containing a street sign.||Image Analysis Skill or OCR Skill|
|Natural language processing||Text processing for insights and information about text inputs. Language detection, sentiment analysis, and key phrase extraction are skills that fall under natural language processing.||Key Phrase Extraction Skill, Language Detection Skill, Sentiment Analysis Skill|
|Document cracking||The process of extracting or creating text content from non-text sources during indexing. Optical character recognition (OCR) is an example, but generally it refers to core indexer functionality as the indexer extracts content from application files. The data source providing source file location, and the indexer definition providing field mappings, are both key factors in document cracking.||See Indexers|
|Shaping||Consolidate text fragments into a larger structure, or conversely break down larger text chunks into a manageable size for further downstream processing.||Shaper Skill, Text Merger Skill, Text Split Skill|
|Enriched documents||A transitory internal structure, not directly accessible in code. Enriched documents are generated during processing, but only final outputs are persisted in a search index. Field mappings determine which data elements are added to the index.||See Accessing enriched documents.|
|Indexer||A crawler that extracts searchable data and metadata from an external data source and populates an index based on field-to-field mappings between the index and your data source for document cracking. For cognitive search enrichments, the indexer invokes a skillset, and contains the field mappings associating enrichment output to target fields in the index. The indexer definition contains all of the instructions and references for pipeline operations, and the pipeline is invoked when you run the indexer.||Indexers|
|Data Source||An object used by an indexer to connect to an external data source of supported types on Azure.||See Indexers|
|Index||A persisted search corpus in Azure Search, built from an index schema that defines field structure and usage.||Indexes in Azure Search|
Where do I start?
Step 1: Create an Azure Search resource in a region providing the APIs
- West Central US
- South Central US
- East US
- East US 2
- West US 2
- Canada Central
- West Europe
- UK South
- North Europe
- Brazil South
- Southeast Asia
- Central India
- Australia East
Step 2: Hands-on experience to master the workflow
Step 3: Review the API (REST only)
Currently, only REST APIs are provided. Use
api-version=2017-11-11-Preview on all requests. Use the following APIs to build a cognitive search solution. Only two APIs are added or extended for cognitive search. Other APIs have the same syntax as the generally available versions.
|Create Data Source||A resource identifying an external data source providing source data used to create enriched documents.|
|Create Skillset (api-version=2017-11-11-Preview)||A resource coordinating the use of predefined skills and custom cognitive skills used in an enrichment pipeline during indexing.|
|Create Index||A schema expressing an Azure Search index. Fields in the index map to fields in source data or to fields manufactured during the enrichment phase (for example, a field for organization names created by entity recognition).|
|Create Indexer (api-version=2017-11-11-Preview)||A resource defining components used during indexing: including a data source, a skillset, field associations from source and intermediary data structures to target index, and the index itself. Running the indexer is the trigger for data ingestion and enrichment. The output is a search corpus based on the index schema, populated with source data, enriched through skillsets.|
Checklist: A typical workflow
Subset your Azure source data into a representative sample. Indexing takes time so start with a small, representative data set and then build it up incrementally as your solution matures.
Create a data source object in Azure Search to provide a connection string for data retrieval.
Create a skillset with enrichment steps.
Define the index schema. The Fields collection includes fields from source data. You should also stub out additional fields to hold generated values for content created during enrichment.
Define the indexer referencing the data source, skillset, and index.
Within the indexer, add outputFieldMappings. This section maps output from the skillset (in step 3) to the inputs fields in the index schema (in step 4).
Send Create Indexer request you just created (a POST request with an indexer definition in the request body) to express the indexer in Azure Search. This step is how you run the indexer, invoking the pipeline.
Run queries to evaluate results and modify code to update skillsets, schema, or indexer configuration.
Reset the indexer before rebuilding the pipeline.
For more information about specific questions or problems, see Troubleshooting tips.