Training new font in tesseract 3.04 - tesseract

I'm trying to train new belarussian font in tesseract 3.04. I edited box files, made some traineddata files with dictionaries. Now I have traineddata, which do some correct recognition results, but not 80% quality at least. How can I improve recognition quality? I have multi tiff with 29 pages. Now I use this traineddata like "-l bel+bel1"

Related

What to put in txt file when training tesseract

I am trying to train tesseract to recognize a new font and was wondering how many words I would need to put into the txt file to increase its accuracy. For example, fo I put in just a couple of words but with different combinations or up to 10,000+ words?
See Prepare a text file section of Training Tesseract 3.03–3.05 page.

Training tesseract 4 with images instead of font

I have some questions about making tiff/box files for tesseract 4.
In TrainingTesseract 4.00 document written:
Making Box Files As with base Tesseract, there is a choice between
rendering synthetic training data from fonts, or labeling some
pre-existing images (like ancient manuscripts for example).
But it did not explain how to train with pre-existing images.
I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.
How can I make tif/box for tessearct 4 lstm then label them and
how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian
language is right to left )?
Should I use fine tuning or train from Scratch?
I was struggling just like you, until I found this github repository:
https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
1) Use below command to make lstmbox:
tesseract test.tif test-lstmbox -l eng --psm 6 lstmbox
It will make a lstmbox for you but you have to correct the character in box file.
2) You require enough data for training from Scratch So I suggest fine tuning is better option.

Tesseract mixing up "1" and "7" despite training on exact font

I am using tesseract to get text from an image, I am only interested in numbers. I have trained tesseract and created a new language that is the exact font in the image and the training data only included numbers. In the training data I also included every possible value that would be in an image, 1-5000 to be specific and also created a wordlist of these same values. However it still mixes up 1 and 7, as well as sometimes 3 and 8. Does anybody have any recommendations on whether I should retrain differently or do some processing on the image before giving it to tesseract?
Make sure there are at least 20 instances of every character in the training texts you provide to tesseract. I give at least 6 pages of the same font to have a decent training sample size.
2.Tesseract Text Recognition also depends on the image quality. Check out possible preprocessing algorithms you can use: Improve Quality of Tesseract
Take a look at the number_dawg file. Modifying it can help recognising digits.

Understanding webp encoder options

I'm currently experimenting with webp encoder (no wic) on windows 64 environment. My samples are 10 jpg stock photos depicting landscapes and houses, and the photos already optimized in jpegtran. I do this because my goal is to optimize the images of a whole website where the images have already been compressed with photoshop using the save for web command with various values on quality and then optimized with jpegtran.
I found out that using values smaller than -q 85 have a visual impact on the quality of the webp images. So I'm playing with values above 90 where the difference is smaller. I also concluded that I have to use -jpeg_like because without it the output is sometimes bigger in size than the original, which is not acceptable. I also use -m 6 -f 100 -strong because I really don't mind about the time the encoder needs to produce the output and trying to achieve the smoother results. I tried several values for these and concluded that -m 6 -f 100 -strong have the best output regarding quality and size.
I also tried the -preset photo avoiding any other parameter except -q but the size of the output gets bigger.
What I don't understand from https://developers.google.com/speed/webp/docs/cwebp#options are the options -sns , -segments which seem to have a great impact on the output size. Sometimes the output is bigger and sometimes smaller in size for the same options but I haven't concluded yet what is the reason for that and how to properly use them.
I also don't understand the -sharpness option which doesn't have an impact at the output size at least for me.
My approach is far less than a scientific approach and more like a trial and error method and If anybody knows how to use those options for the specific input and explain them for optimum results I would appreciate such a feedback.
-strong and -sharpness only change the strength of the filtering in the header of the compressed bitstream. They will be used at decoding time. That's why you don't see a change in file size for these.
-sns controls the choice of filtering strength and quantization values within each segments. A segment is just a group of macroblocks in the picture, that are believed to be sharing similar properties regarding complexity and compressibility. A complex photo should likely use the maximum allowed 4 segments (which is the default).

Is there any way to speed up extraction using tesseract OCR Engine, while tiff file is having 600-700 pages?

During processing of tiff files, which are having 600 - 700 pages from Tesseract OCR engine with hocr option, we monitored that files are taking around 40 - 50 minutes.
We monitored that it is so much time for processing large files.
Do we have any way to speed up the process?
Following command is using: -
<Drive>:\Tesseract-OCR>tesseract.exe "Source_Tiff_File" "Destination_File" hocr
You can split up the multi-page TIFF and run them in multiple processes.