Large, unstructured datasets like the JFK Files, which contain over 34,000 pages of documents about the CIA investigation of the 1963 JFK assassination, include typewritten and handwritten notes, photos and diagrams, and other unstructured data that standard search solutions can't parse.
AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable text from images, blobs, and other unstructured data sources like the JFK Files by using pre-trained machine learning skillsets from the Cognitive Services Computer Vision and Text Analytics APIs. You can also create and attach custom skills to add special processing for domain-specific data like CIA Cryptonyms. Azure Cognitive Search can then index and search the context.
Image processing skills like Optical Character Recognition (OCR), Read, and image analysis include object and face detection, tag and caption generation, and celebrity and landmark identification. These skills create text representations of image content, which are searchable using the query capabilities of Azure Cognitive Search.
This example solution uses Azure Cognitive Search AI enrichment to extract meaning from the original complex, unstructured JFK Files dataset. You can work through the project, watch the process in action in an online video, or explore the JFK Files with an online demo.
Potential use cases
- Increase the value and utility of unstructured text and image content in search and data science apps.
- Use custom skills to integrate open-source, third-party, or first-party code into indexing pipelines.
- Make scanned BMP or JPG documents full-text searchable.
- Produce better outcomes than standard PDF text extraction for PDFs with combined image and text.
- Create new information from inherently meaningful raw content or context that's hidden in larger unstructured or semi-structured documents.
This diagram illustrates the process of passing unstructured data through the Cognitive Search skills pipeline to produce structured, indexable data.
- Blob storage provides unstructured document and image data to Cognitive Search.
- Cognitive Search applies pre-built cognitive skillsets to the data, including OCR, text and handwriting recognition, image analysis, entity recognition, and full-text search.
- The Cognitive Search extensibility mechanism uses an Azure Function to apply the CIA Cryptonyms custom skill to the data.
- The pre-built and custom skillsets deliver structured knowledge that Azure Cognitive Search can index.
Azure Cognitive Search works with other Azure components to provide this solution.
Azure Blob Storage
Azure Blob Storage is REST-based object storage for data that you can access from anywhere in the world via HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. Blob storage is ideal for large amounts of unstructured data like text or graphics.
Azure Cognitive Search
Cognitive Search indexes the content and powers the user experience. You use Cognitive Search capabilities to apply pre-built cognitive skills to the content, and use the extensibility mechanism to add custom skills.
The Computer Vision API uses text recognition APIs to extract and recognize text information from images. Read uses the latest recognition models, and is optimized for large, text-heavy documents and noisy images. OCR isn't optimized for large documents, but supports more languages. The current example uses OCR to produce data in the hOCR format.
custom skills extend Cognitive Search to apply specific enrichment transformations to content. The current example creates a custom skill to apply CIA Cryptonyms, which decode uppercased code names in CIA documents. For example, the CIA assigned the cryptonym
GPFLOORto Lee Harvey Oswald, so the custom CIA Cryptonym skill links any JFK files containing that cryptonym with Oswald.
Azure Functions is a serverless compute service that lets you run small pieces of event-triggered code without having to explicitly provision or manage infrastructure. This example uses an Azure Function method to apply the CIA Cryptonyms list to the JFK Files as a custom skill.
Azure App Service
The example solution also builds a standalone web app in Azure App Service for testing, demonstrating, searching the index, and exploring connections in the enriched and indexed documents.
Issues and considerations
- The code project and demo showcase a particular Cognitive Search use case. This example isn't intended to be a framework or scalable architecture for all scenarios, but to provide a general guideline and example.
- OCR results vary greatly depending on scan and image quality. Read uses the latest recognition models, but has less language support than OCR.
- Some scanned and native PDF formats may not parse correctly in Cognitive Search.
- The JFK Files sample project and demo create a public website and publicly readable storage container for extracted images, so don't use this solution with non-public data.