Build a training data set for a custom model
When you use the Form Recognizer custom model, you provide your own training data to the Train Custom Model operation, so that the model can train to your industry-specific forms. Follow this guide to learn how to collect and prepare data to train the model effectively.
You need at least five filled-in forms of the same type.
If you want to use manually labeled training data, you must start with at least five filled-in forms of the same type. You can still use unlabeled forms in addition to the required data set.
Custom model input requirements
First, make sure your training data set follows the input requirements for Form Recognizer.
- For best results, provide one clear photo or high-quality scan per document.
- Supported file formats: JPEG/JPG, PNG, BMP, TIFF, and PDF (text-embedded or scanned). Text-embedded PDFs are best to eliminate the possibility of error in character extraction and location.
- For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).
- The file size must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.
- Image dimensions must be between 50 x 50 pixels and 10,000 x 10,000 pixels.
- PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.
- The total size of the training data is 500 pages or less.
- If your PDFs are password-locked, you must remove the lock before submission.
Training data tips
Follow these additional tips to further optimize your data set for training.
- If possible, use text-based PDF documents instead of image-based documents. Scanned PDFs are handled as images.
- For filled-in forms, use examples that have all of their fields filled in.
- Use forms with different values in each field.
- If your form images are of lower quality, use a larger data set (10-15 images, for example).
Upload your training data
When you've put together the set of form documents that you'll use for training, you need to upload it to an Azure blob storage container. If you don't know how to create an Azure storage account with a container, follow the Azure Storage quickstart for Azure portal. Use the standard performance tier.
If you want to use manually labeled data, you'll also have to upload the .labels.json and .ocr.json files that correspond to your training documents. You can use the Sample Labeling tool (or your own UI) to generate these files.
Organize your data in subfolders (optional)
By default, the Train Custom Model API will only use form documents that are located at the root of your storage container. However, you can train with data in subfolders if you specify it in the API call. Normally, the body of the Train Custom Model call has the following format, where <SAS URL> is the Shared access signature URL of your container:
{
"source":"<SAS URL>"
}
If you add the following content to the request body, the API will train with documents located in subfolders. The "prefix" field is optional and will limit the training data set to files whose paths begin with the given string. So a value of "Test", for example, will cause the API to look at only the files or folders that begin with the word "Test".
{
"source": "<SAS URL>",
"sourceFilter": {
"prefix": "<prefix string>",
"includeSubFolders": true
},
"useLabelFile": false
}
Next steps
Now that you've learned how to build a training data set, follow a quickstart to train a custom Form Recognizer model and start using it on your forms.
- Train a model and extract form data using the client library or REST API
- Train with labels using the sample labeling tool
See also
Povratne informacije
Pošalјite i prikažite povratne informacije za