Azure Document Intelligence crashing when trying to read unknown characters

PrivaC 0

Hi, I'm trying to parse data from a image that's in Bengali the native language of Bangladesh and while using document intelligence to parse the information I'm getting the following error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 76-84: character maps to <undefined>

which made me believe that it's not finding the bengali characters, so I tried out with a full english image and it worked fine. The document I want to read has both bengali as well as latin characters. I was wondering if there was a way to ignore unknown bengali characters while parsing. Thank you in advance

dupammi 7,140 Reputation points Microsoft Vendor

2024-05-08T10:07:51.6233333+00:00

Hi @PrivaC

Thank you for the question.

Based on the error message you provided, it seems like Azure Document Intelligence is encountering a Unicode encoding error when trying to read Bengali characters from your image. This error may be caused by Azure Document Intelligence not recognizing the Bengali characters properly during parsing.

One solution you can try is to explicitly specify the language of the document as Bengali. Additionally, you can ignore unknown characters by setting up custom preprocessing steps before parsing. However, Azure Document Intelligence may not have built-in functionality to ignore specific characters from unsupported languages during parsing.

Another solution you can try is using Azure Cognitive Services Computer Vision to extract text from the image and then use Azure Cognitive Services Text Analytics to analyze the extracted text.

I hope this helps.
dupammi 7,140 Reputation points Microsoft Vendor

2024-05-09T03:16:03.24+00:00

Hi @PrivaC

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, please send the image you were using, so that I will get in touch with the internal team to check and provide their analysis.

Share via

Azure Document Intelligence crashing when trying to read unknown characters