I am using tesseract to get text from an image, I am only interested in numbers. I have trained tesseract and created a new language that is the exact font in the image and the training data only included numbers. In the training data I also included every possible value that would be in an image, 1-5000 to be specific and also created a wordlist of these same values. However it still mixes up 1 and 7, as well as sometimes 3 and 8. Does anybody have any recommendations on whether I should retrain differently or do some processing on the image before giving it to tesseract?
Make sure there are at least 20 instances of every character in the training texts you provide to tesseract. I give at least 6 pages of the same font to have a decent training sample size.
2.Tesseract Text Recognition also depends on the image quality. Check out possible preprocessing algorithms you can use: Improve Quality of Tesseract
Take a look at the number_dawg file. Modifying it can help recognising digits.
Related
I have some questions about making tiff/box files for tesseract 4.
In TrainingTesseract 4.00 document written:
Making Box Files As with base Tesseract, there is a choice between
rendering synthetic training data from fonts, or labeling some
pre-existing images (like ancient manuscripts for example).
But it did not explain how to train with pre-existing images.
I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.
How can I make tif/box for tessearct 4 lstm then label them and
how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian
language is right to left )?
Should I use fine tuning or train from Scratch?
I was struggling just like you, until I found this github repository:
https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
1) Use below command to make lstmbox:
tesseract test.tif test-lstmbox -l eng --psm 6 lstmbox
It will make a lstmbox for you but you have to correct the character in box file.
2) You require enough data for training from Scratch So I suggest fine tuning is better option.
This question already has an answer here:
What is the typical method to separate connected letters in a word using OCR
(1 answer)
Closed 5 years ago.
I made a basic OCR system in Matlab using correlation. (It's not a professional project, only as an exercise and I am not using the ocr() function of Matlab). My code is working almost correctly for clean text images. But if I make the job a little harder (taking the text photo for side position with angle) my code does not give good results. I use Principal Component Analysis for correct text alignment, but if I do this (taking photo with angle), the characters are very close together and I can't separate them for the recognizing process.
Original Image and after preprocessing (adaptive thresholding, adjusting,PCA)
How can I separate the characters correctly?
An alternative to what Yves suggests is to erode the image. Which is implemented as imerode in matlab. Perhaps scale the image first (though it is not needed here)
e.g. with this code
ocr(imerode(I,strel('disk',3)))
where I is your "BOOLEAN" black-white image, I receive
ocrText with properties:
Text: 'BOOLEAN↵↵'
CharacterBoundingBoxes: [9×4 double]
CharacterConfidences: [9×1 single]
Words: {'BOOLEAN'}
WordBoundingBoxes: [14 36 208 43]
WordConfidences: 0.5477
Splitting characters is a pretty difficult problem.
Unless the character widths are constant (which is the case for this image but might not be true with other letters), the methods based on projection analysis (vertcial extent of the characters as a function of abscissa) will fail.
Actually, for a method to be effective, it must be font-aware, i.e. know in advance what the alphabet looks like. In other words, you can't separate segmentation from recognition.
A possibility is to attempt to decompose the blob assumed to be made of touching characters (possibly based on projections or known character sizes), perform the recognition, and check the recognition results. Preferably, try several decompositions and keep the best.
I am trying to use NVIDIA DIGITS to perform image segmentation (by following this article). After training the model, I was able to perform the segmentation on individual images.
Now I would like to give a list of test images and labeled images to DIGITS, and I want it to calculate accuracy for the input images. Is it possible to do it using DIGITS, if yes, then what are the steps? (options to select on framework while training/testing model, format of input file, folder structure, etc.)
Also, if DIGITS can't do it, then are there any other frameworks? (I read this post, but it would be better if I can directly use some plug-and-play framework, to begin with)
Thanks!
I am using Caffe to extract features with matlab wrapper.I have 5011 images as test data set.I chopped all the layers after 'relu7' in 'deploy.prototxt'. I found out if you take the same image as input of matcaffe_demo.m and matcaffe_batch.m, you will get the different 4096-dim features.
Could someone tell me why?
what is the differences between you extract features from all these images one by one with matcaffe_demo.m and extract features by listing all these images with matcaffe_batch.m?
You can find the answer to this question at caffe github.
Basically, matcaffe_demo is used for classification and it averages results of 10 crops of the input image, while matcaffe_bathc uses only a single input.
Moreover, note that these m-files are no longer available in recent caffe versions.
I 'm using Tesseract OCR engine in an iPhone application to read specific numeric fields from bill invoice photos.
Using a lot of photo pre-processing (adaptive thresholding, artifact cleaning, etc) the results are finally fairly accurate but there are still some cases I want to improve.
If the user takes a photo in low-light conditions and there is some noise or artifacts in the picture, the OCR engine interprets these artifacts as additional digits. In some rear cases it can read e.g. a numeric amount of "32,15" EUR as "5432,15" EUR and this is not at all good for the final user confidence in the product.
I assume that, if there is an internal OCR engine read-error associated to each character read, it will be higher on the "54" digits of my previous example as they are recognized over small noise-pixels, and if I had access to this reading-error values I will be able to easily discard the erroneous digits.
Do you know of any method to get a reading error magnitude (or any "accuracy factor" value) for each individual character returned from tesseract OCR engine?
It is called "confidence" value in Tesseract terminology. Search for that term in tesseract-ocr Group turned up many answers that mention about a TesserractExtractResult method.
The hOCR output also contains this value.