how to use indexer with imageAction(generateNormalizedImagePerPage) in search index?

Rasul 0 Reputation points
2024-04-25T16:50:06.5133333+00:00

I'm using this script https://github.com/microsoft/sample-app-aoai-chatGPT/blob/main/scripts/data_preparation.py and I want to get page number. I find out it can be done using indexer with imageAction=generateNormalizedImagePerPage configuration but I couldn't know how to do.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
720 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
970 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Grmacjon-MSFT 16,186 Reputation points
    2024-04-25T23:28:09.4266667+00:00

    Hello @Rasul , thanks for the question.

    The script you linked (https://github.com/Azure-Samples/openai-chat-app-quickstart) focuses on data preparation for Azure Cognitive Search, and by default, it doesn't capture page numbers within your documents.

    While the documentation mentions the imageAction parameter with the generateNormalizedImagePerPage value, it's intended for a specific scenario: generating image thumbnails for each page when processing image documents. It won't directly extract page numbers from your text documents.

    To get the page number from the files ingested using the Azure OpenAI Retrieval-Augmented Generation (RAG) model, you need to modify the indexing configuration and the way you process the model's output.

    Here's how you can achieve this:

    1. Modify the Indexing Configuration: In the data_preparation.py script, you need to modify the create_search_index function to include the imageAction configuration. This will ensure that the Azure Cognitive Search indexer generates normalized images for each page of the ingested documents.
    
    def create_search_index(index_name, index_definition):
    
        # Create or update the Azure Cognitive Search index
    
        index = SearchIndexer(...)
    
        # Define the indexer
    
        indexer = SearchIndexer(...)
    
        indexer.indexer_description = {
    
            "imageAction": "generateNormalizedImagePerPage",
    
            # Other indexer configurations
    
        }
    
        # Run the indexer
    
        indexer.run()
    
    
    1. Access Page Number in the Model's Output: In the RAG model's output, the page number information will be available as part of the context. You need to modify the parse_result function to extract the page number from the context.
    
    def parse_result(result):
    
        # Split the result into the generated answer and the context
    
        answer, context = result.split("\n\nContext:")
    
        # Extract the page number from the context
    
        page_number = extract_page_number(context)
    
        return answer, page_number
    
    def extract_page_number(context):
    
        # Implement logic to extract the page number from the context string
    
        # This will depend on the format of the context provided by the RAG model
    
        # For example, if the context is "Page 5: ..." you can use:
    
        if context.startswith("Page "):
    
            return int(context.split("Page ")[1].split(":")[0])
    
        else:
    
            return None
    
    
    1. Modify the Main Script: Finally, you need to modify the main script (app.py) to handle the page number returned by the parse_result function.
    
    def main():
    
        # ...
    
        result = deployment(query)
    
        answer, page_number = parse_result(result)
    
        if page_number:
    
            print(f"Answer: {answer}")
    
            print(f"Page Number: {page_number}")
    
        else:
    
            print(f"Answer: {answer}")
    
    

    With these changes, your application should now print the page number along with the answer, if the page number information is available in the context provided by the RAG model.

    Please note that the exact implementation details may vary depending on the format of the context string provided by the RAG model. You may need to adjust the extract_page_number function accordingly.

    Hope that helps.

    -Grace

    0 comments No comments