Hello @Rasul , thanks for the question.
The script you linked (https://github.com/Azure-Samples/openai-chat-app-quickstart) focuses on data preparation for Azure Cognitive Search, and by default, it doesn't capture page numbers within your documents.
While the documentation mentions the imageAction
parameter with the generateNormalizedImagePerPage
value, it's intended for a specific scenario: generating image thumbnails for each page when processing image documents. It won't directly extract page numbers from your text documents.
To get the page number from the files ingested using the Azure OpenAI Retrieval-Augmented Generation (RAG) model, you need to modify the indexing configuration and the way you process the model's output.
Here's how you can achieve this:
- Modify the Indexing Configuration: In the
data_preparation.py
script, you need to modify thecreate_search_index
function to include theimageAction
configuration. This will ensure that the Azure Cognitive Search indexer generates normalized images for each page of the ingested documents.
def create_search_index(index_name, index_definition):
# Create or update the Azure Cognitive Search index
index = SearchIndexer(...)
# Define the indexer
indexer = SearchIndexer(...)
indexer.indexer_description = {
"imageAction": "generateNormalizedImagePerPage",
# Other indexer configurations
}
# Run the indexer
indexer.run()
- Access Page Number in the Model's Output: In the RAG model's output, the page number information will be available as part of the context. You need to modify the
parse_result
function to extract the page number from the context.
def parse_result(result):
# Split the result into the generated answer and the context
answer, context = result.split("\n\nContext:")
# Extract the page number from the context
page_number = extract_page_number(context)
return answer, page_number
def extract_page_number(context):
# Implement logic to extract the page number from the context string
# This will depend on the format of the context provided by the RAG model
# For example, if the context is "Page 5: ..." you can use:
if context.startswith("Page "):
return int(context.split("Page ")[1].split(":")[0])
else:
return None
- Modify the Main Script: Finally, you need to modify the main script (
app.py
) to handle the page number returned by theparse_result
function.
def main():
# ...
result = deployment(query)
answer, page_number = parse_result(result)
if page_number:
print(f"Answer: {answer}")
print(f"Page Number: {page_number}")
else:
print(f"Answer: {answer}")
With these changes, your application should now print the page number along with the answer, if the page number information is available in the context provided by the RAG model.
Please note that the exact implementation details may vary depending on the format of the context string provided by the RAG model. You may need to adjust the extract_page_number
function accordingly.
Hope that helps.
-Grace