Extracting text and interleaved figures from a scanned PDF

Ken Forbus 1 Reputation point
2021-04-27T18:50:13.777+00:00

I'm using material from a scanned book in an experiment (with publisher permission, of course), which predates ebooks. So I have high-quality scans of every page. The book is novel in that that (a) there is on average at least one image per page, often several and (b) they are not delimited by boxes nor do they have figure numbers. It's a popular science book, so for example textual labels in the images are hand-written. I'm trying to figure out a good way of extracting both the text from each page and the images, ideally into something civilized like JSON or XML, with the rough sequential ordering on the page preserved. Anyone know of a good method for this? Thanks.

Azure AI Language
Azure AI Language
An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.
364 questions
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 46,996 Reputation points
    2021-04-28T04:27:02.79+00:00

    Hello,

    The Computer Vision Read API is Azure's latest OCR technology (learn what's new) that extracts printed text (in several languages), handwritten text (English only), digits, and currency symbols from images and multi-page PDF documents. It's optimized to extract text from text-heavy images and multi-page PDF documents with mixed languages. It supports detecting both printed and handwritten text in the same image or document.

    The Read API includes the following features.

    Print text extraction in 73 languages
    Handwritten text extraction in English
    Text lines and words with location and confidence scores
    No language identification required
    Support for mixed languages, mixed mode (print and handwritten)
    Select pages and page ranges from large, multi-page documents
    Natural reading order for text lines
    Handwriting classification for text lines
    Available as Distroless Docker container for on-premise deployment

    I think this is a good way for novel since novel is a kind of heavy text document.

    https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview-ocr#read-api

    Regards,
    Yutong

    0 comments No comments