FormsRecognizer how to create ocr.json files programmatically for custom classification model

Tim Bates 20

In order to automate the creation of a new custom classification model programmatically it appears we need to create ocr.json files for each file before we call the BuildDocumentClassifierAsync method via the DocumentModelAdministrationClient SDK. These files appear to contain an object with some outer details and an embedded AnalyzeResult. The documentation describes running prebuilt-layout against each file in the set before trying to train the model. However, when programmatically performing a prebuilt-layout extraction the AnalyzeResult returned and serialized is in a slightly different shape to that seen when Studio creates the ocr files. Some fields have different names (polygon vs boundingpolygon etc) and some object fields also differ - even though the api version returned is the same.
The analyzeResult is also embedded in the json as though an outer object has been serialized, but it is not clear what that object might be.
If the shape does not match the expected json structure (obtained from Studio when clicking "Train"), the subsequent call to BuildDocumentClassiferAsync fails.
Is it possible to generate the same ocr.json shape programmatically? And if so, how do we do this please?

dupammi 6,815 Reputation points Microsoft Vendor

2023-10-03T13:16:31.2433333+00:00

Hi @Tim Bates ,

Thank you for reaching out to the Microsoft Q&A forum. I will be happy to assist you regarding this.

The differences you are seeing in the "content", "boundary regions" and "polygon" fields between Studio and the SDK-generated JSON are because the SDK is providing information based on its automated analysis of the document's layout, whereas Studio reflects the specific regions it identified because of the labelling step in the studio. The studio identifies the words, lines, tables (if any), number of rows in the table, each cell of the table and its content. If it is a table, it encloses the "polygon data" within the "bounding region". All this is in-built in the studio.

Coming to your question of automating the ocr.json, we need to implement our custom logic to span across all the tables, its cells, then intermittently form the JSON tag enclosing of "polygon data" within the "boundary region" tags etc. Please refer to the document

The document intelligence Azure documentation also talks about Document Intelligence Sample Labeling tool website. Please also explore the same.

Please see below reference documents for more details:

The OCR Form Labeling Tool:

OCR Form Labeling Tool

Document Intelligence Sample Labeling tool website

Setup the sample labelling tool:

How-to: Analyze documents, Label forms, train a model, and analyze forms with Document Intelligence (formerly Form Recognizer) - Azure AI services | Microsoft Learn

Connect to sample labelling tool:

How-to: Analyze documents, Label forms, train a model, and analyze forms with Document Intelligence (formerly Form Recognizer) - Azure AI services | Microsoft Learn

Python Quickstart for labelled data:

cognitive-services-quickstart-code/python/FormRecognizer/rest/python-labeled-data.mdat master · Azure-Samples/cognitive-services-quickstart-code (github.com)
dupammi 6,815 Reputation points Microsoft Vendor

2023-10-04T11:20:52.6566667+00:00

Hi @Tim Bates ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Please also check below Python code samples (versions 3.2.0 & later) to see if you can reuse those scripts as per your requirement.

Samples for Azure Form Recognizer client library for Python - Code Samples | Microsoft Learn
Tim Bates 20 Reputation points

2023-10-04T14:20:52.96+00:00

Hi @dupammi,

Thanks very much for your replies to my question, much appreciated.
My question relates to issues we were having trying to automate the creation of a new classifier model.
To add some context around the question :

Assume we have a forms recognizer and custom classification project and associated blob storage with some documents in that blob storage.
In Studio, we can then see the available documents, and choose a classification for each document.
If we do that, as files are manually classified via Studio, a file per classification is generated and added to the blob storage, so if we have classifications C1, C2, C3, and we classified a number of documents in each, we'd then see a C1.jsonl file, a C2.jsonl file, C3.jsonl being added to the blob storage.
Assuming we have met the minimum criteria of 5 docs per classification, when we click "Train", and add a model name, the model will be generated. As part of that process we see files being generated in the storage, one per blob item in the storage, so a document C1Document1.pdf, would gain a C1Document1.pdf.ocr.json file in the storage, similar for other documents.

We are trying to automate this process and are using DocumentModelAdministrationClient to do so.
That client library offers methods for Training a new model.
It requires a dictionary object to describe the blobs to include in each classification.
However, when that runs it does not create new ocr.json files for new files added to the storage, nor does it amend the jsonl files. And this means even though the build/train method call will succeed, it won't have picked up any changes in the storage in the new model created.

From researching, this seems to be because the ocr.json files are produced by Studio when it runs the "prebuilt-layout" extraction model for any files in the storage that don't have an ocr.json already. This happens behind the scenes in Studio (I think).

So, in our code we are trying to mimic this and run via the DocumentAnalysisClient client library, an extraction using "prebuilt-layout", for each new file, to get the layout data to then write to a corresponding ocr.json file and upload to the blob storage.

We are then also recreating the jsonl files for each category, so each one will contain the json to identify each blob within each category. We are using a folder name that each blob is uploaded to, where the folder name denotes the category, so we can then know what the classification should be.

So, a lot of background there :-)

The problem I had, was getting json representing the layout extraction that is the same shape as the json automatically produced when Studio creates the ocr.json files. If the shape is different to what it is expecting if fails to train the model via Studio, or programmatically and returns an error.

The client library returns an operation object when running an extraction, and the value of that is an AnalyzeResult.
The ocr.json file contains an analyzeResult object but embedded within some other object which we don't know.
If I serialize the AnalyzeResult object returned from the client library call, the format is different to that produced by Studio, even if the underlying data is the same.

Trying to fabricate the same shape was difficult, but possible, although very fragile if anything were to change, so not really a way forwards.

However, since I posted my question I have found a solution to this.
The operation object returned has a GetRawResponse() method.
Calling that, and then taking the value of the raw response ".Content" gives you a BinaryData object representing the layout json in the same shape as Studio produces in the ocr.json files.
If I upload that BinaryData as a new blob file with the right name, I can then programmatically build the model successfully (or in Studio).
(Note it's also important to (re)create the jsonl files to represent the classifications, otherwise Studio gets very confused when showing documents and their classifications). These files are far simpler to create though.

Hopefully, the above helps to clarify the question and the solution to it. Thanks again for your help.
dupammi 6,815 Reputation points Microsoft Vendor

2023-10-04T14:46:50.83+00:00

Hi @Tim Bates ,

Glad to know that your issue has been resolved. And thanks for sharing the detailed question and working solution, which might be beneficial to other community members reading this thread.
Marcus Denny 0 Reputation points

2023-10-26T04:02:54.9166667+00:00

Thanks so much Tim for the clear explanation - it's a bit of a pain that there are so many hoops to jump through to do this programmatically, but this approach works beautifully for me too.

Accepted answer

dupammi 6,815 Reputation points Microsoft Vendor

2023-10-05T07:10:04.2866667+00:00

Hi @Tim Bates ,

I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

Question: Is there any way to create ocr.json files programmatically for custom classification model using FormsRecognizer ?

Solution: Using the SDK, the operation object's response returned has a GetRawResponse() method. Calling that, and then taking the value of the raw response ". Content" gives a BinaryData object representing the layout JSON in the same shape as Studio produces in the ocr.json files.
Uploading the BinaryData as a new blob file with the right name, can then programmatically build the model successfully using SDK (or in Studio).

If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

I hope this helps!

If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.
Please sign in to rate this answer.

2 people found this answer helpful.
William Hawkins 25 Reputation points

2024-04-10T22:29:16.6566667+00:00

Hey everyone,

I am so happy I found this, my team and I have been trying to work on a solution of the same concept for the last 6 months with no success. I've currently written the following Python function:
`

from azure.core.pipeline.transport import HttpResponse from azure.ai.formrecognizer import DocumentAnalysisClient from azure.core.credentials import AzureKeyCredential from azure.storage.blob import BlobServiceClient endpoint = "https://hatchetai-uat.cognitiveservices.azure.com/" key = "35ade0d370fc476ab7d8be883b4bf5d1" blob_service_connection_string = 'DefaultEndpointsProtocol=https;AccountName=hatchetai;AccountKey=cSYcZAiBhc8/QGQ4fLF2acEf0MFaaauo+36ScJ80fykRRB+beZJ6kWeD+MPqLInI++fxk04kxDzjdZyoC/u5/w==;EndpointSuffix=core.windows.net' blob_client_container_name = 'will-self-learning-ai-test-3' blob_client_blob_name = 'clean-bill-001.pdf.ocr.json' blob_sas_url = "https://hatchetai.blob.core.windows.net/will-self-learning-ai-test-3/clean-bill-001.pdf?sp=racwdyti&st=2024-04-10T22:03:38Z&se=2024-04-11T06:03:38Z&spr=https&sv=2022-11-02&sr=b&sig=ermZa7kMBVVULbpWR97qgI69AzkoxXs4tYRLKHFYqdU%3D" training_data_container_sas_url = 'https://hatchetai.blob.core.windows.net/will-self-learning-ai-test-3?sp=racwdli&st=2024-04-10T21:36:43Z&se=2024-04-11T05:36:43Z&spr=https&sv=2022-11-02&sr=c&sig=Ve6VPjxLAwMLKrfRL7Wn1F%2FR1l4Uz%2FYuIWN5kVvMrr0%3D' model_id = 'emma-model-xxx' import json from azure.ai.formrecognizer import DocumentAnalysisClient from azure.core.credentials import AzureKeyCredential from azure.storage.blob import BlobServiceClient from azure.ai.formrecognizer import DocumentModelAdministrationClient def build_model(blob_sas_url, model_id, training_data_container_sas_url): # Create a client for the Form Recognizer service document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key)) # Analyze the layout of the document poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-layout", blob_sas_url) # Get the raw response raw_response = poller.get_raw_response() # Get the BinaryData object representing the layout json layout_binary_data = raw_response.http_response.content # Create a new blob with the right name blob_service_client = BlobServiceClient.from_connection_string(blob_service_connection_string) blob_client = blob_service_client.get_blob_client(blob_client_container_name, blob_client_blob_name) # Upload the BinaryData as a new blob file blob_client.upload_blob(layout_binary_data) # Create a client for the Form Recognizer service document_model_admin_client = DocumentModelAdministrationClient(endpoint=endpoint, credential=AzureKeyCredential(key)) # Now you can programmatically build the model successfully using SDK poller = document_model_admin_client.begin_build_document_model(build_mode='neural',blob_container_url=training_data_container_sas_url, model_id=model_id) model = poller.result() print(f"Model ID: {model.model_id}") print(f"Status: {model.status}") print(f"Created on: {model.created_on}") print(f"Last modified: {model.last_modified}") build_model(blob_sas_url, model_id, training_data_container_sas_url)

To try and use the layout api response to generate the ocr.json file to use in the training request but am receiving this error:
"The above exception was the direct cause of the following exception: HttpResponseError Traceback (most recent call last) Cell In[43], line 56 49 print(f"Last modified: {model.last_modified}") 51 #print(f"Model ID: {model.model_id}") 52 #print(f"Status: {model.status}") 53 #print(f"Created on: {model.created_on}") 54 #print(f"Last modified: {model.last_modified}") ---> 56 build_model(blob_sas_url, model_id, training_data_container_sas_url) ... Code: InvalidArgument Message: Invalid argument. Exception Details: (InvalidContentSourceFormat) Invalid content source: Could not read build content. Code: InvalidContentSourceFormat Message: Invalid content source: Could not read build content."

I know it is pretty generic error message but seems to have something to do with the file I uploaded under the name file.pdf.ocr.json so if anyone has any insight on how to proceed that would be greatly appreciated!!!

Thanks,

Will

Tim Bates 20 Reputation points

2024-04-11T09:27:52.0966667+00:00

Hi Will,

I've not used Python I'm afraid and the api and way the polling is done for long running operations is different when done from c#.
But one thing I've noticed in your code example is :

"

# Analyze the layout of the document poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-layout", blob_sas_url) # Get the raw response raw_response = poller.get_raw_response()

"

I think you need to call poller.result() before you call poller.get_raw_response().
Otherwise you are getting the raw response from the last polling attempt the poller made, but not necessarily when it has completed.

Something like the following (which came out of a question I asked chatgpt about this :-))...

"

# Assume poller is an instance of LROPoller result = poller.result() # Wait for operation to complete # Now that the operation is completed, retrieve the raw HTTP response raw_response = poller.get_raw_response()

"

Not sure if this is your issue, or the fix, but worth a try.

Tim.

dupammi 6,815 Reputation points Microsoft Vendor

2024-04-11T09:58:16.9966667+00:00

Hi @William Hawkins

I would like to bring your attention to the latest thread that has arisen from and refers to this present thread.

I hope latest response from @Tim Bates and the above latest thread would help you in investigating and arriving at a solution.

Thank you.

William Hawkins 25 Reputation points

2024-04-11T15:25:34.4366667+00:00

Hey Tim and dupammi,

Thank you sincerely for the updates and suggestions. Tim, I leveraged your suggestion and still had issues as a result in the difference with the Python SDK. I was however to create a workaround.
This function can successfully create the layout JSONs and upload them to the blob storage:
`

def build_model(blob_sas_url, model_id, training_data_container_sas_url): # Create a client for the Form Recognizer service document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key)) # Analyze the layout of the document poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-layout", blob_sas_url) # Assume poller is an instance of LROPoller result = poller.result() # Serialize the analysis result # Note: Adjust the serialization based on the specific data you need from the result # Serialize the analysis result with the correct attribute for bounding information analysis_result_data = { "pages": [ { "pageNumber": page.page_number, "lines": [ { "text": line.content, # Use 'bounding_polygon' or adjust based on your SDK version "boundingBox": [point.__dict__ for point in line.bounding_polygon] if hasattr(line, 'bounding_polygon') else None } for line in page.lines ] } for page in result.pages ] } result_json = json.dumps(analysis_result_data, ensure_ascii=False, indent=4) # Upload the JSON data to Azure Blob Storage blob_service_client = BlobServiceClient.from_connection_string(blob_service_connection_string) blob_client = blob_service_client.get_blob_client(container=blob_client_container_name, blob=blob_client_blob_name) # Note: This uploads or overwrites the blob with the JSON data blob_client.upload_blob(result_json, overwrite=True) print(f"Uploaded the analysis result to: {blob_client.url}")

However, I try to train a model on the 5 PDFs I've done this for that are in the container using this code:
`

def train_model(blob_sas_url, model_id, training_data_container_sas_url): # Create a client for the Form Recognizer service document_model_admin_client = DocumentModelAdministrationClient(endpoint=endpoint, credential=AzureKeyCredential(key)) # Now you can programmatically build the model successfully using SDK poller = document_model_admin_client.begin_build_document_model(build_mode='neural',blob_container_url=training_data_container_sas_url, model_id=model_id) model = poller.result() print(f"Model ID: {model.model_id}") print(f"Status: {model.status}") print(f"Created on: {model.created_on}") print(f"Last modified: {model.last_modified}") train_model(blob_sas_url, model_id, training_data_container_sas_url)

And got this error again:
`---------------------------------------------------------------------------

OperationFailed Traceback (most recent call last)

File c:\Users\willh\Documents\PythonProjects\willenv\Lib\site-packages\azure\core\polling\base_polling.py:757, in LROBasePolling.run(self)

756 try:

--> 757 self._poll()

759 except BadStatus as err:

File c:\Users\willh\Documents\PythonProjects\willenv\Lib\site-packages\azure\core\polling\base_polling.py:789, in LROBasePolling._poll(self)

788 if _failed(self.status()):

--> 789 raise OperationFailed("Operation failed or canceled")

791 final_get_url = self._operation.get_final_get_url(self._pipeline_response)

OperationFailed: Operation failed or canceled

The above exception was the direct cause of the following exception:

HttpResponseError Traceback (most recent call last)

Cell In[61], line 16

11 print(f"Created on: {model.created_on}")

12 print(f"Last modified: {model.last_modified}")

---> 16 train_model(blob_sas_url, model_id, training_data_container_sas_url)

Cell In[61], line 7

5 # Now you can programmatically build the model successfully using SDK

6 poller = document_model_admin_client.begin_build_document_model(build_mode='neural',blob_container_url=training_data_container_sas_url, model_id=model_id)

...

Code: InvalidArgument

Message: Invalid argument.

Exception Details: (InvalidContentSourceFormat) Invalid content source: Could not read build content.

Code: InvalidContentSourceFormat

Message: Invalid content source: Could not read build content.

Not sure if there's any additional suggestions you can recommend or anything that sticks out to you as to why the build content would be invalid but I appreciate any guidance you can offer:)

Thank you and all the best,
Will

William Hawkins 25 Reputation points

2024-04-11T15:35:14.3766667+00:00

---THIS IS A DUPLICATE MESSAGE FEEL FREE TO IGNORE------

Hey Tim & dupammi,

Thank you sincerely for your suggestions, Tim, I tried to apply what you mentioned and was able to refactor the code to to the following:
`

from azure.core.pipeline.transport import HttpResponse from azure.ai.formrecognizer import DocumentAnalysisClient from azure.core.credentials import AzureKeyCredential from azure.storage.blob import BlobServiceClient endpoint = "https://hatchetai-uat.cognitiveservices.azure.com/" key = "35ade0d370fc476ab7d8be883b4bf5d1" blob_service_connection_string = 'DefaultEndpointsProtocol=https;AccountName=hatchetai;AccountKey=cSYcZAiBhc8/QGQ4fLF2acEf0MFaaauo+36ScJ80fykRRB+beZJ6kWeD+MPqLInI++fxk04kxDzjdZyoC/u5/w==;EndpointSuffix=core.windows.net' blob_client_container_name = 'will-self-learning-ai-test-3' blob_client_blob_name = 'clean-bill-005.pdf.ocr.json' blob_sas_url = "https://hatchetai.blob.core.windows.net/will-self-learning-ai-test-3/clean-bill-005.pdf?sp=racwdyti&st=2024-04-11T15:19:20Z&se=2024-04-11T23:19:20Z&spr=https&sv=2022-11-02&sr=b&sig=lFbxN2cgIbMFIgaw1R7swAoMjMKNveWXGw9gGWJk1t8%3D" training_data_container_sas_url = 'https://hatchetai.blob.core.windows.net/will-self-learning-ai-test-3?sp=racwdli&st=2024-04-11T15:02:50Z&se=2024-04-11T23:02:50Z&spr=https&sv=2022-11-02&sr=c&sig=437ZcQjkxovoTjq%2FVS4KlkzarXBJYWQO2tBuYXPNTUQ%3D' model_id = 'emma-model-xxx' import json from azure.ai.formrecognizer import DocumentAnalysisClient from azure.core.credentials import AzureKeyCredential from azure.storage.blob import BlobServiceClient from azure.ai.formrecognizer import DocumentModelAdministrationClient import json from azure.core.pipeline.transport import HttpResponse def build_model(blob_sas_url, model_id, training_data_container_sas_url): # Create a client for the Form Recognizer service document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key)) # Analyze the layout of the document poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-layout", blob_sas_url) # Assume poller is an instance of LROPoller result = poller.result() # Serialize the analysis result # Note: Adjust the serialization based on the specific data you need from the result # Serialize the analysis result with the correct attribute for bounding information analysis_result_data = { "pages": [ { "pageNumber": page.page_number, "lines": [ { "text": line.content, # Use 'bounding_polygon' or adjust based on your SDK version "boundingBox": [point.__dict__ for point in line.bounding_polygon] if hasattr(line, 'bounding_polygon') else None } for line in page.lines ] } for page in result.pages ] } result_json = json.dumps(analysis_result_data, ensure_ascii=False, indent=4) # Upload the JSON data to Azure Blob Storage blob_service_client = BlobServiceClient.from_connection_string(blob_service_connection_string) blob_client = blob_service_client.get_blob_client(container=blob_client_container_name, blob=blob_client_blob_name) # Note: This uploads or overwrites the blob with the JSON data blob_client.upload_blob(result_json, overwrite=True) print(f"Uploaded the analysis result to: {blob_client.url}") build_model(blob_sas_url, model_id, training_data_container_sas_url)

With this, I am now able to successfully upload the layout API results as a JSON file to the Azure blob container I want to use to train the DI model. However, when I run this code to do so:
`

def train_model(blob_sas_url, model_id, training_data_container_sas_url): # Create a client for the Form Recognizer service document_model_admin_client = DocumentModelAdministrationClient(endpoint=endpoint, credential=AzureKeyCredential(key)) # Now you can programmatically build the model successfully using SDK poller = document_model_admin_client.begin_build_document_model(build_mode='neural',blob_container_url=training_data_container_sas_url, model_id=model_id) model = poller.result() print(f"Model ID: {model.model_id}") print(f"Status: {model.status}") print(f"Created on: {model.created_on}") print(f"Last modified: {model.last_modified}") train_model(blob_sas_url, model_id, training_data_container_sas_url)

I get the error:
"--------------------------------------------------------------------------- OperationFailed Traceback (most recent call last) File [c:\Users\willh\Documents\PythonProjects\willenv\Lib\site-packages\azure\core\polling\base_polling.py:757], in LROBasePolling.run**(self)** [756] try: --> [757] self._poll() [759] except BadStatus as err: File [c:\Users\willh\Documents\PythonProjects\willenv\Lib\site-packages\azure\core\polling\base_polling.py:789], in LROBasePolling._poll**(self)** [788] if _failed(self.status()): --> [789] raise OperationFailed("Operation failed or canceled") [791] final_get_url = self._operation.get_final_get_url(self._pipeline_response) OperationFailed: Operation failed or canceled The above exception was the direct cause of the following exception: HttpResponseError Traceback (most recent call last) Cell In[62], [line 16 ][11] print(f"Created on: {model.created_on}") [12] print(f"Last modified: {model.last_modified}") ---> [16] train_model(blob_sas_url, model_id, training_data_container_sas_url) Cell In[62], [line 7 ][5] # Now you can programmatically build the model successfully using SDK [6] poller = document_model_admin_client.begin_build_document_model(build_mode='neural',blob_container_url=training_data_container_sas_url, model_id=model_id)

...

Code: InvalidArgument Message: Invalid argument. Exception Details: (InvalidContentSourceFormat) Invalid content source: Could not read build content. Code: InvalidContentSourceFormat Message: Invalid content source: Could not read build content."

again...

Not sure if there's anything that you notice that's causing the problem but any further guidance would be greatly appreciated:)

Thanks and all the best,
Will

dupammi 6,815 Reputation points Microsoft Vendor

2024-04-12T05:56:14.0366667+00:00

Hi @William Hawkins

The error message indicates that the operation failed or was canceled, which caused an OperationFailed exception. This exception was the direct cause of the HttpResponseError exception that occurred later. The InvalidArgument error message suggests that there is an issue with the content source, and the build content could not be read.

To resolve this issue, you may want to check the following:

Ensure that the blob_sas_url and training_data_container_sas_url parameters are correct and valid.

Verify that the content source is in the correct format and can be read by the Form Recognizer service.

Check if there are any issues with the training data, such as missing or invalid files.

I hope this helps in further troubleshooting your issue and resolve it.

Thank you.

William Hawkins 25 Reputation points

2024-04-12T21:10:20.85+00:00

Hi dupammi,

Thank you for your reply.

To resolve this issue, you may want to check the following:

Ensure that the blob_sas_url and training_data_container_sas_url parameters are correct and valid. - blob_sas_url is correct because I am able to successfully generate the OCR files and I just re-tried the container SAS url with an updated SAS URL and got the same error

Verify that the content source is in the correct format and can be read by the Form Recognizer service. - I'm not sure what you mean by this, how can I verify that the format can be read by the FR service? Isn't the error indicating the file can't be read?

Check if there are any issues with the training data, such as missing or invalid files. - All files needed to train are in the container:

Unless there is an unknown issue with the labels jsons I've generated programmatically but I have been working on this for 6 months with MS support and that has not been mentioned as a probable cause of the errors I've been experiencing.

Thank you in advance for any further guidance or suggestions you can provide:)

Thanks,

Will

dupammi 6,815 Reputation points Microsoft Vendor

2024-04-13T02:07:12.1933333+00:00

Hi @William Hawkins

Thank you for providing additional information about the issue you are facing.

The error message "InvalidContentSourceFormat: Invalid content source: Could not read build content" indicates that there is an issue with the content source access you are trying to use to train the model.

As per Point 2 in this github code, while generating the SAS URL to blob storage container give these permissions:

Read

List

Regarding my second point, in previous response verifying that the content source is in the correct format and can be read by the Form Recognizer service means ensuring that the input files are in a format that is supported by the service and can be processed successfully. You may want to check if the input files are not corrupted or damaged.

Since you have already confirmed that the input files are in the correct format and the blob_sas_url is correct, it is possible that there is an issue with the training data or the labels JSON files. You may want to review the labels JSON files and ensure that they are in the correct format and contain the required information. As part of debugging, you may also try using the pre-built models provided by Form Recognizer to see if the issue persists.

Please also have a look at a similar thread, which might assist you in debugging your code.

If the issue persists, you may want to contact Azure support again and provide them with the details of the issue and the steps you have taken so far to troubleshoot it. They may be able to provide additional guidance and assistance in resolving the issue.

I hope this helps.

William Hawkins 25 Reputation points

2024-04-26T19:58:53.9366667+00:00

Hey dupammi,

Thank you for the help and apologies for the delay. This thread: https://github.com/Azure/azure-sdk-for-python/issues/34370

Helped me resolve the issue generating the OCR JSON files so thank you that's great.

I am still not able to train a model however so I would think this because of my labels files. I've attached clean-bill-001.pdf.labels copy.txt as an example of the labels JSON files I've generated. Do you see anything that could cause the model to fail to train?

Thanks in advance for your support:)

All the best,

Will

dupammi 6,815 Reputation points Microsoft Vendor

2024-04-26T22:57:10.6333333+00:00

Hi @William Hawkins

Thank you for providing the details.

To address your question:

Firstly, ensure that both the labels.json and ocr.json files are generated for each training document. Additionally, you should find a file named fields.json, which consolidates all the fields created for your model. These files are crucial for the training process and should be present and correctly structured.

Secondly, you have the flexibility to edit the extracted files from the backend or directly manipulate them. Any modifications made either through the tool interface or the files themselves prompt an option to retrain. This action initiates the creation of a new model, incorporating the updated data, which can then be seamlessly integrated into your application for data extraction as needed.

As part of troubleshooting the training process, consider these steps:

Double-check the structure and format of your training data, ensuring that both the labels JSON files, associated OCR results and fields.json are correctly formatted and contain the necessary information.

Verify that the paths specified for the training data, including both labels, fields and OCR results, are accurate and accessible.

Experiment with different API versions to determine if one version provides more reliable performance for your specific use case.

If the issue persists, you may want to contact Azure support again and provide them with the details of the model training issue and the steps you have taken so far to troubleshoot it. They may be able to provide additional guidance and assistance in resolving the issue.

I hope this helps.
Sign in to comment

FormsRecognizer how to create ocr.json files programmatically for custom classification model

0 additional answers