Form Recognizer Read OCR model
Form Recognizer v3.0 preview includes the new Read Optical Character Recognition (OCR) model. The Read OCR model extracts typeface and handwritten text including mixed languages in documents. The Read OCR model can detect lines, words, locations, and languages and is the core of all other Form Recognizer models. Layout, general document, custom, and prebuilt models all use the Read OCR model as a foundation for extracting texts from documents.
Supported document types
| Model | Images | TIFF | Word | Excel | PowerPoint | HTML | |
|---|---|---|---|---|---|---|---|
| Read | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Data extraction
| Read model | Text | Language detection |
|---|---|---|
| prebuilt-read | ✓ | ✓ |
Development options
The following resources are supported by Form Recognizer v3.0:
| Feature | Resources | Model ID |
|---|---|---|
| Read model | prebuilt-read |
Try Form Recognizer
Try extracting text from forms and documents using the Form Recognizer Studio. You'll need the following assets:
An Azure subscription—you can create one for free
A Form Recognizer instance in the Azure portal. You can use the free pricing tier (
F0) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.
Form Recognizer Studio (preview)
Note
Currently, Form Recognizer Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats in the Read preview.
Sample form processed with Form Recognizer Studio
On the Form Recognizer Studio home page, select Read
You can analyze the sample document or select the + Add button to upload your own sample.
Select the Analyze button:
Input requirements
- For best results, provide one clear photo or high-quality scan per document.
- Supported file formats: JPEG/JPG, PNG, BMP, TIFF, and PDF (text-embedded or scanned). Text-embedded PDFs are best to eliminate the possibility of error in character extraction and location. Additionally, the newest API version
2022-06-30-previewsupports Microsoft Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML files in Read model. - For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).
- The file size for analyzing documents must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.
- Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.
- PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.
- If your PDFs are password-locked, you must remove the lock before submission.
- The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).
- For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.
- For custom model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.
Supported languages and locales
Form Recognizer preview version supports several languages for the read model. See our Language Support for a complete list of supported handwritten and printed languages.
Data detection and extraction
Pages
With the added support for Microsoft Word, Excel, PowerPoint, and HTML files, the page units in the model output are computed as shown:
| File format | Computed page unit | Total pages |
|---|---|---|
| Images | Each image = 1 page unit | Total images |
| Each page in the PDF = 1 page unit | Total pages in the PDF | |
| Word | Up to 3,000 characters = 1 page unit, Each embedded image = 1 page unit | Total pages of up to 3,000 characters each + Total embedded images |
| Excel | Each worksheet = 1 page unit, Each embedded image = 1 page unit | Total worksheets + Total images |
| PowerPoint | Each slide = 1 page unit, Each embedded image = 1 page unit | Total slides + Total images |
| HTML | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
Text lines and words
Read extracts print and handwritten style text as lines and words. The model outputs bounding polygon coordinates and confidence for the extracted words. The styles collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.
For Microsoft Word, Excel, PowerPoint, and HTML file formats, Read will extract all embedded text as is. For any embedded images, it will run OCR on the images to extract text and append the text from each image as an added entry to the pages collection. These added entries will include the extracted text lines and words, their bounding polygons, confidences, and the spans pointing to the associated text.
Language detection
Read adds language detection as a new feature for text lines. Read will predict all detected languages for text lines along with the confidence in the languages collection under analyzeResult.
Select page (s) for text extraction
For large multi-page PDF documents, use the pages query parameter to indicate specific page numbers or page ranges for text extraction.
Note
For Microsoft Word, Excel, PowerPoint, and HTML file formats, the Read API ignores the pages parameter and extracts all pages by default.
Next steps
Complete a Form Recognizer quickstart:
Explore our REST API:
Povratne informacije
Pošalјite i prikažite povratne informacije za