Form Recognizer Read OCR model
Form Recognizer v3.0 preview includes the new Read Optical Character Recognition (OCR) model. The Read OCR model extracts typeface and handwritten text including mixed languages in documents. The Read OCR model can detect lines, words, locations, and languages and is the core of all other Form Recognizer models. Layout, general document, custom, and prebuilt models all use the Read OCR model as a foundation for extracting texts from documents.
Supported document types
|Read model||Text||Language detection|
The following resources are supported by Form Recognizer v3.0:
Try Form Recognizer
Try extracting text from forms and documents using the Form Recognizer Studio. You'll need the following assets:
An Azure subscription—you can create one for free
A Form Recognizer instance in the Azure portal. You can use the free pricing tier (
F0) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.
Form Recognizer Studio (preview)
Currently, Form Recognizer Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats in the Read preview.
Sample form processed with Form Recognizer Studio
On the Form Recognizer Studio home page, select Read
You can analyze the sample document or select the + Add button to upload your own sample.
Select the Analyze button:
- Supported file formats: These include JPEG/JPG, PNG, BMP, TIFF, PDF (text-embedded or scanned). Additionally, the newest API version
2022-06-30-previewsupports Microsoft Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML files.
- For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).
- The file size must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.
- Image dimensions must be between 50 x 50 pixels and 10,000 x 10,000 pixels.
- The minimum height of the text to be extracted is 12 pixels for a 1024X768 image. This dimension corresponds to about eight font point text at 150 DPI.
Supported languages and locales
Form Recognizer preview version supports several languages for the read model. See our Language Support for a complete list of supported handwritten and printed languages.
Data detection and extraction
With the added support for Microsoft Word, Excel, PowerPoint, and HTML files, the page units in the model output are computed as shown:
|File format||Computed page unit||Total pages|
|Images||Each image = 1 page unit||Total images|
|Each page in the PDF = 1 page unit||Total pages in the PDF|
|Word||Up to 3,000 characters = 1 page unit, Each embedded image = 1 page unit||Total pages of up to 3,000 characters each + Total embedded images|
|Excel||Each worksheet = 1 page unit, Each embedded image = 1 page unit||Total worksheets + Total images|
|PowerPoint||Each slide = 1 page unit, Each embedded image = 1 page unit||Total slides + Total images|
|HTML||Up to 3,000 characters = 1 page unit, embedded or linked images not supported||Total pages of up to 3,000 characters each|
Text lines and words
Read extracts print and handwritten style text as
words. The model outputs bounding
polygon coordinates and
confidence for the extracted words. The
styles collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.
For Microsoft Word, Excel, PowerPoint, and HTML file formats, Read will extract all embedded text as is. For any embedded images, it will run OCR on the images to extract text and append the text from each image as an added entry to the
pages collection. These added entries will include the extracted text lines and words, their bounding polygons, confidences, and the spans pointing to the associated text.
Read adds language detection as a new feature for text lines. Read will predict all detected languages for text lines along with the
confidence in the
languages collection under
Select page (s) for text extraction
For large multi-page PDF documents, use the
pages query parameter to indicate specific page numbers or page ranges for text extraction.
For Microsoft Word, Excel, PowerPoint, and HTML file formats, the Read API ignores the pages parameter and extracts all pages by default.
Complete a Form Recognizer quickstart:
Explore our REST API:
Submit and view feedback for