Form Recognizer Read OCR model

Form Recognizer v3.0 preview includes the new Read Optical Character Recognition (OCR) model. The Read OCR model extracts typeface and handwritten text including mixed languages in documents. The Read OCR model can detect lines, words, locations, and languages and is the core of all other Form Recognizer models. Layout, general document, custom, and prebuilt models all use the Read OCR model as a foundation for extracting texts from documents.

Supported document types

Model Images PDF TIFF Word Excel PowerPoint HTML
Read

Data extraction

Read model Text Language detection
prebuilt-read

Development options

The following resources are supported by Form Recognizer v3.0:

Feature Resources Model ID
Read model prebuilt-read

Try Form Recognizer

Try extracting text from forms and documents using the Form Recognizer Studio. You'll need the following assets:

  • An Azure subscription—you can create one for free

  • A Form Recognizer instance in the Azure portal. You can use the free pricing tier (F0) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.

Screenshot: keys and endpoint location in the Azure portal.

Form Recognizer Studio (preview)

Note

Currently, Form Recognizer Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats in the Read preview.

Sample form processed with Form Recognizer Studio

Screenshot: Read processing in Form Recognizer Studio.

  1. On the Form Recognizer Studio home page, select Read

  2. You can analyze the sample document or select the + Add button to upload your own sample.

  3. Select the Analyze button:

    Screenshot: analyze read menu.

Input requirements

  • Supported file formats: These include JPEG/JPG, PNG, BMP, TIFF, PDF (text-embedded or scanned). Additionally, the newest API version 2022-06-30-preview supports Microsoft Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML files.
  • For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).
  • The file size must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.
  • Image dimensions must be between 50 x 50 pixels and 10,000 x 10,000 pixels.
  • The minimum height of the text to be extracted is 12 pixels for a 1024X768 image. This dimension corresponds to about eight font point text at 150 DPI.

Supported languages and locales

Form Recognizer preview version supports several languages for the read model. See our Language Support for a complete list of supported handwritten and printed languages.

Data detection and extraction

Pages

With the added support for Microsoft Word, Excel, PowerPoint, and HTML files, the page units in the model output are computed as shown:

File format Computed page unit Total pages
Images Each image = 1 page unit Total images
PDF Each page in the PDF = 1 page unit Total pages in the PDF
Word Up to 3,000 characters = 1 page unit, Each embedded image = 1 page unit Total pages of up to 3,000 characters each + Total embedded images
Excel Each worksheet = 1 page unit, Each embedded image = 1 page unit Total worksheets + Total images
PowerPoint Each slide = 1 page unit, Each embedded image = 1 page unit Total slides + Total images
HTML Up to 3,000 characters = 1 page unit, embedded or linked images not supported Total pages of up to 3,000 characters each

Text lines and words

Read extracts print and handwritten style text as lines and words. The model outputs bounding polygon coordinates and confidence for the extracted words. The styles collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.

For Microsoft Word, Excel, PowerPoint, and HTML file formats, Read will extract all embedded text as is. For any embedded images, it will run OCR on the images to extract text and append the text from each image as an added entry to the pages collection. These added entries will include the extracted text lines and words, their bounding polygons, confidences, and the spans pointing to the associated text.

Language detection

Read adds language detection as a new feature for text lines. Read will predict all detected languages for text lines along with the confidence in the languages collection under analyzeResult.

Select page (s) for text extraction

For large multi-page PDF documents, use the pages query parameter to indicate specific page numbers or page ranges for text extraction.

Note

For Microsoft Word, Excel, PowerPoint, and HTML file formats, the Read API ignores the pages parameter and extracts all pages by default.

Next steps

Complete a Form Recognizer quickstart:

Explore our REST API: