question

HaiquanLi-0222 avatar image
0 Votes"
HaiquanLi-0222 asked ·

Text content missing from the returned PageResult in Java sdk

Hi, when testing computer vision java sdk with two similar pdf files (one Englist, one French). Some string visible in the pdf files are not reported by PageResult when testing with one version of pdf, although they read properly from another version of pdf file. It is assumed all visible strings should be reported in PageResult object. For example, at the end of the each page, there is a file version of 3885A (11/20) in both version of pdf files. But the computer vision Java sdk only returns this string when testing with v1.pdf, but not with v2.pdf. Could someone help on this issue and find out why some strings are missing even if the strings are visible in pdfÉ v1.pdf and v2.pdf are attached for reference. Thanks Jonathan[78822-v1.pdf][1][78823-v2.pdf][2] [1]: /answers/storage/attachments/78822-v1.pdf [2]: /answers/storage/attachments/78823-v2.pdf

azure-computer-vision
v1.pdf (335.4 KiB)
v2.pdf (360.8 KiB)
· 1
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello,

Hope you have solved the problem, please let us know if you have more questions, thanks.

Regards,
Yutong

0 Votes 0 ·

1 Answer

YutongTie-MSFT avatar image
0 Votes"
YutongTie-MSFT answered ·

Thanks for reaching out to us, but I can not open your pdf file. Could you please upload again?

And 2 products I will recommend if you are trying to extract text from PDF

  1. Form recognizer https://azure.microsoft.com/en-us/services/cognitive-services/form-recognizer/

  2. Read API https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text#read-api

Thanks.

Regards,
Yutong



·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.