Things we tried.
We currently train some sample files with multiple tables and key Value Pairs on the PDF.
1) when we train the model well, with 100% accuracy, and the parse a new file which hasnt been trained before, we get have low confidence rates on the data from the new file. Some attributes arent even mapped.
2) When We have similar format files with different data, the accuracy of the model isnt 100%. We get better confidence rates, but still not in the 90's like we expect. The confidence rates are between 20 -80 % .
3) We then believed the way we were tagging the data was a deterrent to getting better confidence scores. We renamed he tags , as table1, table2 etc hoping to get better accuracy while reading the data from the model . We still rceived lower confidence scores.
There are a few points I'd like to understand .
1) When we tag the data, is the model AI going to base future scans on the position of the data in the PDF, or the actuall content itself.
For example, if I tag the word 'Microsoft' in a document... will it look for the word 'Microsoft in all pages of the PDF, or will it look for it in the location based on the files that were trained.
2) What can we do to improve the confidence score of files being scanned in the future ?
3) The Tables we have, are not always in the same coordinates, they would vary based on the data that we have. We found it dificult selecting the entire table, as a couple of columns would get skipped. We tagged each of the values from the table. In case there are more rows than what I have tagged in the model. will I be able to get the data from those additional rows.