If you'd like to see us expand this article with more information, implementation details, pricing guidance, or code examples, let us know with GitHub Feedback!
Large, unstructured datasets like the JFK Files, which contains over 34,000 pages of documents about the CIA investigation of the 1963 JFK assassination, include typewritten and handwritten notes, photos and diagrams, and other unstructured data that standard search solutions can't parse.
AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable text from images, blobs, and other unstructured data sources like the JFK Files by using pre-trained machine learning skillsets from the Cognitive Services Computer Vision and Text Analytics APIs. You can also create and attach custom skills to add special processing for domain-specific data like CIA Cryptonyms. Azure Cognitive Search can then index and search the context.
The Azure Cognitive Search skills in this example solution fall into the following categories:
Image-processing built-in skills like optical character recognition (OCR), print extraction, and image analysis include object and face detection, tag and caption generation, and celebrity and landmark identification. These skills create text representations of image content, which are searchable using the query capabilities of Azure Cognitive Search. Document cracking is the process of extracting or creating text content from non-text sources.
Natural language processing built-in skills like entity recognition, language detection, key phrase extraction, and text recognition map unstructured text to searchable and filterable fields in an index.
Custom skills that capture domain-specific data. These skills are build with the custom skills interface.
This example solution uses Azure Cognitive Search AI enrichment to extract meaning from the original complex, unstructured JFK Files dataset. You can work through the project, watch the process in action in an online video, or explore the JFK Files with an online demo.
Potential use cases
- Increase the value and utility of unstructured text and image content in search and data science apps.
- Use custom skills to integrate open-source, third-party, or first-party code into indexing pipelines.
- Make scanned JPG, PNG, or bitmap documents full-text searchable.
- Produce better outcomes than standard PDF text extraction for PDFs with combined image and text.
- Create new information from inherently meaningful raw content or context that's hidden in larger unstructured or semi-structured documents.
Architecture converting unstructured data to structured data
This diagram illustrates the process of passing unstructured data through the Cognitive Search skills pipeline to produce structured, indexable data.
- Blob storage provides unstructured document and image data to Cognitive Search.
- Cognitive Search applies pre-built cognitive skillsets to the data, including OCR, text and handwriting recognition, image analysis, entity recognition, and full-text search.
- The Cognitive Search extensibility mechanism uses an Azure Function to apply the CIA Cryptonyms custom skill to the data.
- The pre-built and custom skillsets deliver structured knowledge that Azure Cognitive Search can index.
Azure Cognitive Search works with other Azure components to provide this solution.
Azure Blob Storage
Azure Blob Storage is REST-based object storage for data that you can access from anywhere in the world via HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. Blob storage is ideal for large amounts of unstructured data like text or graphics.
Azure Cognitive Search
Cognitive Search indexes the content and powers the user experience. You use Cognitive Search capabilities to apply pre-built cognitive skills to the content, and use the extensibility mechanism to add custom skills.
The Computer Vision API uses text recognition APIs to extract and recognize text information from images. Read uses the latest recognition models, and is optimized for large, text-heavy documents and noisy images. OCR isn't optimized for large documents, but supports more languages. The current example solution uses OCR to produce data in the hOCR format.
Custom skills extend Cognitive Search to apply specific enrichment transformations to content. The current example solution creates a custom skill to apply CIA Cryptonyms, which decode uppercase code names in CIA documents. For example, the CIA assigned the cryptonym
GPFLOORto Lee Harvey Oswald, so the custom CIA Cryptonym skill links any JFK files containing that cryptonym with Oswald.
Azure Functions is a serverless compute service that lets you run small pieces of event-triggered code without having to explicitly provision or manage infrastructure. This example solution uses an Azure Function method to apply the CIA Cryptonyms list to the JFK Files as a custom skill.
Azure App Service
This example solution also builds a standalone web app in Azure App Service for testing, demonstrating, searching the index, and exploring connections in the enriched and indexed documents.
- The code project and demo showcase a particular Cognitive Search use case. This example solution isn't intended to be a framework or scalable architecture for all scenarios, but to provide a general guideline and example.
- OCR results vary greatly depending on scan and image quality. The Computer Vision Read API uses the latest recognition models, but has less language support than OCR.
- Some scanned and native PDF formats may not parse correctly in Cognitive Search.
- The JFK Files sample project and demo create a public website and publicly readable storage container for extracted images, so don't use this solution with non-public data.
Explore the JFK dataset:
- Explore the JFK Files project on GitHub.
- Watch the process in action in an online video.
- Explore the JFK Files online demo.
Read product documentation:
- Azure Cognitive Search
- Get started with AI enrichment
- Computer Vision
- Text Analytics
- Recognize printed and handwritten text
- How to use Named Entity Recognition in Text Analytics
- Azure Blob storage
- Azure Functions
Try the Microsoft Learn path: