question

JuliaSizova-4527 avatar image
0 Votes"
JuliaSizova-4527 asked ramr-msft commented

Duplicated words returned from computer vision API

Hi, I'm using read API to extract typed and handwritten text from pdf. When pdf is scanned, all is working as expected. However if pdf is already OCRed, then json response of extracted text has duplicated words and phrases (with some duplicates containing typos, example attached). These duplicated appear on the same line. If I convert such pdf to image first, this problem doesn't occur. Is there a way to overcome this step of converting pdf to image by passing some additional argument or some other solution? We can't control the type of pdf being sent to us.

Attached is an example screenshot of output with duplications.95564-output-example.jpg


azure-computer-vision
output-example.jpg (357.6 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ramr-msft avatar image
0 Votes"
ramr-msft answered ramr-msft commented

@JuliaSizova-4527 Thanks for the question. Can you please share the sample pdf is already OCRed that you are trying, also please add more details about the Read API and OCR API version that you are trying.
The Computer Vision Read API is Azure's latest OCR technology (learn what's new) that extracts printed text (in several languages), handwritten text (English only), digits, and currency symbols from images and multi-page PDF documents. It's optimized to extract text from text-heavy images and multi-page PDF documents with mixed languages. It supports detecting both printed and handwritten text in the same image or document.

Please follow the Read API v3.2:https://centraluseuap.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-2/operations/5d986960601faab4bf452005


· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Unfortunately, I can't share this document, as it has sensitive information. I've tried several ways to reproduce this problem with redacted document, but didn't succeed. If blacking this info out on original pdf, it's not visible on pdf itself but is still returned from the Azure API. I've also tried to convert this document to image, remove sensitive info and then OCR again, but then the problem with duplications disappears. Making up a new example document also doesn't reproduce the problem.

The API we are using is 'https://{name}/vision/v3.0/read/analyze?language=en' to submit request and 'https://{name}/vision/v3.0/read/analyzeResults/' to get response.

0 Votes 0 ·

@JuliaSizova-4527 ·Thanks for the update. We would recommend to raise a Azure support desk ticket from Help+Support blade from Azure portal for your resource if you have a support plan for your subscription. This will help you to share the details securely and work with an engineer who can provide more insights about the issue that if it can be replicated.

0 Votes 0 ·