question

JakubLubowicki-1610 avatar image
0 Votes"
JakubLubowicki-1610 asked JakubLubowicki-1610 answered

Azure Form Recognizer duplicating text extracted from PDF

While extracting values using Azure Form Recognizer, many values are shown duplicated.

I have trained a custom model labelling the appropriate key values. I find that the OCR duplicates the boxes, so that when I am labelling using the sample labeling tool I often get one box inside the other.I need to pick one and deselect the other, to avoid showing the value duplicated.

When I run the model to predict a new PDF for many keys I also get the values duplicated.

Furthermore, upon inspection of the Result JSON I can see that many Lines have the Bounded Boxes nested, or overlapping. That is, typically you would have a Line that has a bounded box and text associated that in turn have "Words" that have a bounded box inside the Bounded Box of the Line.

Just to clarify, in the JSON I am seeing Lines, that have overlapping or nested Bounded Boxes and therefore text.

Any clues as to why this can be?

azure-form-recognizer
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@JakubLubowicki-1610 Thanks for the question. Can you please share the sample input document to check on this. Also please share the screenshot and JSON response that you are getting.
Please follow the document to Train a custom model using the sample labeling tool.


0 Votes 0 ·

1 Answer

JakubLubowicki-1610 avatar image
0 Votes"
JakubLubowicki-1610 answered

data-00000004.pdf (98.2 KiB)
json.png (32.0 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.