question

BenediktSchmaler-0060 avatar image
0 Votes"
BenediktSchmaler-0060 asked NetaHaiby-1731 commented

Form Recognizer: PDFs with own temporary fonts are not recognized correctly

As it seems, Form Recognizer does not correctly recognize PDF files created with custom temporary fonts.

For example, I have a file that was created with a custom font. In the PDF file, the text looks like this:
99417-grafik.png

But the detection provides this result:
?hZkd]’ Jej[dj_WbWki]b[Y^ kdZ iedij][ MY^kjpcWydW^c[d

This is also the same result when I copy this text from the PDF file and paste it into a text editor.
As far as I can tell, in this case the recognition does not run over the recognition of the text in the image, but over the plain text contained in the PDF file which, because of the font, is not recognized correctly.

Do you know if there are any plans in future releases to recognize text with unknown fonts if they are included in the PDF file?

azure-form-recognizer
grafik.png (9.6 KiB)
· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@BenediktSchmaler-0060 Thanks for reporting. This is an interesting observation, It looks like there is no documentation about the limitation of using a custom font for form recognizer.
Does the same work if you change the font to a standard font? Is it possible to share this document to share the same to the team for their review? Thanks!!

0 Votes 0 ·

Unfortunately, the file is a document of our client and I can't share it here publicly. But I could make it available in a closed area.
But if I use other files with a standard font the service works fine.

0 Votes 0 ·

Can you please try copying and pasting the text from the PDF in a PDF viewer and into a text editor ? You will probably get the same garbled text.

We usually see these type of PDFs from when the originator of the PDF either produced the PDF incorrectly and the important information about the font character mapping is missing in the PDF or in most cases where this is done deliberately by the PDF originator to obfuscate as a protection mechanism to prevent a reader to copy & paste the text data.

0 Votes 0 ·

Copy and paste results in the same outcome.

But in this particular case, if there are temporary fonts in the pdf file, wouldn't it be advantageous to convert them to an image format, for example, and then run the recognition on them? This would prevent garbled text from being returned as a result of the recognition?

0 Votes 0 ·

Form Recognizer adheres to the PDF restrictions and originator setting. You can convert these to images and then send them to Form Recognizer but Form Recognizer in PDFs will adhere to the PDF security and restrictions.

0 Votes 0 ·

0 Answers