What to put in txt file when training tesseract - tesseract

I am trying to train tesseract to recognize a new font and was wondering how many words I would need to put into the txt file to increase its accuracy. For example, fo I put in just a couple of words but with different combinations or up to 10,000+ words?

See Prepare a text file section of Training Tesseract 3.03–3.05 page.

Related

Training tesseract 4 with images instead of font

I have some questions about making tiff/box files for tesseract 4.
In TrainingTesseract 4.00 document written:
Making Box Files As with base Tesseract, there is a choice between
rendering synthetic training data from fonts, or labeling some
pre-existing images (like ancient manuscripts for example).
But it did not explain how to train with pre-existing images.
I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.
How can I make tif/box for tessearct 4 lstm then label them and
how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian
language is right to left )?
Should I use fine tuning or train from Scratch?
I was struggling just like you, until I found this github repository:
https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
1) Use below command to make lstmbox:
tesseract test.tif test-lstmbox -l eng --psm 6 lstmbox
It will make a lstmbox for you but you have to correct the character in box file.
2) You require enough data for training from Scratch So I suggest fine tuning is better option.

What is the meaning of the fifth column in tesseract box files?

During Tesseract box file training, I found the need to write a script to shift some of the boxes. I opened a box file to determine which column corresponds to X/Y/W/H, and discovered a fifth column. The Tesseract wiki doesn't offer any explanations, and the example given in the "Make Box Files" section only contains zeros in the fifth column. My trained file contains other symbols. For example, these are some of the symbols I found: [":,}'4.*<&\;\|]. What do these mean?
You probably meant the sixth or last column, which represents the page number (see Training wiki). It sounds like your box file was not correctly generated.
If I remember correctly, the fifth column is for a whitelist of characters. That way you can specify digits-only for one region, while another is for text.
Tesseract will recognize only symbols from the whitelist for a given region.

Tesseract mixing up "1" and "7" despite training on exact font

I am using tesseract to get text from an image, I am only interested in numbers. I have trained tesseract and created a new language that is the exact font in the image and the training data only included numbers. In the training data I also included every possible value that would be in an image, 1-5000 to be specific and also created a wordlist of these same values. However it still mixes up 1 and 7, as well as sometimes 3 and 8. Does anybody have any recommendations on whether I should retrain differently or do some processing on the image before giving it to tesseract?
Make sure there are at least 20 instances of every character in the training texts you provide to tesseract. I give at least 6 pages of the same font to have a decent training sample size.
2.Tesseract Text Recognition also depends on the image quality. Check out possible preprocessing algorithms you can use: Improve Quality of Tesseract
Take a look at the number_dawg file. Modifying it can help recognising digits.

Tesseract training with multipage tiff

How does the box file need to look like if I use a multipage tiff to train Tesseract?
More precisely: how do the Y-coordinates of a box file correspond to Y-coordinates within pages?
The last, 6th column in the box file represents zero-based page number.
https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files
Update:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
Each font should be put in a single multi-page tiff and the box file
can be modified to specify the page number for each character after
the coordinates. Thus an arbitrarily large amount of training data may
be created for any given font, allowing training for large
character-set languages.
Even if you can have as large training text as you want, it could potentially result in unnecessarily large image and hence slow down training.

How can I read a text file of image intensity values and convert to a cv::Mat?

I am working on a project that requires reading intensity values of several images from a text file that has 3 lines of file header, followed by each image. Each image consists of 15 lines of header followed by the intensity values that are arranged in 48 rows, where each row has 144 tab-delimited pixel values.
I have already created a .mat file to read these into Matlab and create a structure array for each image. I'd like to use OpenCV to track features in the image sequence.
Would it make more sense to create a .cpp file that will read the text file or use OpenCV and Matlab mex files in order to accomplish my goal?
I'd recommend writing C++ code to read the file directly independent of Matlab. This way, you don't have to mess with row major vs. column major ordering and all that jazz. Also, are there specs for the image format? If it turns out to be a reasonably common format, you may be able to find an off-the-self reader/library for it.
If you want the visualization and/or other image processing capabilities of Matlab, then mex file might be a reasonable approach.