Document Intelligence Custom Extraction Model Merging Cells and Unexpected Behavior

Syed Umair Hasan 90 Reputation points
2024-04-30T15:22:35.8566667+00:00

When encountering certain files, Document Intelligence behaves oddly. I've labeled tables from PDFs and trained the model, but it's behaving strangely on some files, even though they were included in the training data. Specifically, Document Intelligence merges two cells in some PDF files. I've observed that whenever it merges cells, a "\n" delimiter appears in the content field of that cell, which means that is identifying correctly that they should go in two seperate cells. This delimiter only appears when this problematic behavior occurs; otherwise, everything there is not backslash n even in a multi line bounding box also.

image

image

I attempted to relabel duplicates of the problematic files, but there was no improvement—the output remains the same for those specific files. However, similar files are being labeled correctly when tested, but this uncertain behavior randomly occurs with some files.

Additionally, I've noticed that the error behavior disappeared for some files when tested yesterday, but it still persists for others.

I'm using API version 2024-02-29 (Preview) and dynamic table with 7 columns. I labeled each table correctly and completely using the shift select method. The training dataset consists of around 90+ files.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,425 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. VasaviLankipalle-MSFT 14,831 Reputation points
    2024-04-30T18:46:29.42+00:00

    Hello @Syed Umair Hasan , Thanks for using Microsoft Q&A Platform.

    I have seen similar issue. The product team is aware of this behavior.

    This is the current limitation of the model. There is a certain probability of incorrect cell merging in tables with a large aspect ratio and small character spacing.

    Currently, we do not have any ETA for this fix.

    Some temporary workarounds to consider include using the latest API version, working with a custom neural model if that fits your use case, and utilizing the Document Intelligence Studio for data labeling, which might be helpful. Please do apply post processing techniques on your own to get the desired result.

    If you require further assistance, please raise a support ticket in the Azure portal.

    I hope this helps.

    Regards,

    Vasavi