Unable to train Tesseract in Ubuntu - tesseract

I am new in Tesseract, I am trying to train Tesseract in Ubuntu, I am using JTessBoxEditor for trainer, I have successfully generated .tif and .box file, but I am unable to Trainer data using it, during JTessBoxEditor Trainer, I don't know which Tesseract Executable location need to select for training data.
Can some please help here. Thanks in advance.

You'd need to build the Tesseract training tools first. Their installation directory can be determined by executing which tesseract command.

Related

Tesseract auxiliary commands

I installed Tesseract and its basic functionality is fine. But when I try following this instruction on language file generation, tesseract-dependent commands like wordlist2dawg are "not found" by the shell.
Q: How do I install Tesseract with all these commands available? It's my understanding that they should work once I installed Tesseract, but it isn't the case. I installed Tesseract via port install tesseract, might be that I missed something.
Q2: How do I actually train Tesseract? I know it's an opaque topic; most results I get online are 3 years old at best, and it's difficult to figure out the exact training mechanism.
You'll need to build the training tools and then follow the instructions in the page.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#building-the-training-tools

running into issues training tesseract

I am new to tesseract and am a bit confused with the different directories in the github page.
The tesserac-ocr code base is what I installed. That installed a tessdata directory in /usr/local/share/tessdata/
So now while training tesseract I run the following command -
# tesseract img.tif img box.train
I get the following error
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Obviously its not able to find the tessdata folder.
So now I obtained the tessdata directory from github (https://github.com/tesseract-ocr/tessdata). Then pointed the TESSDATA_PREFIX to the downloaded tessdata from github. Does not change anything. I get the following error -
Tesseract Open Source OCR Engine v3.03 with Leptonica
read_params_file: Can't open box.train
So my question is what should the tessdata be pointed to? Where does tesseract obtain the box.train from in the training command?
One of most stupid things you can do as novice is try to train tesseract ;-)
Next: 3.03 version is not in official github.com repo (btw: 3.03 was never official released... it was just Ubuntu that made that release.)
Next: if you installed tesseract correctly (from source) box.train is installed. You you installed from Ubuntu packages/repo (I do not think so, because in that case tesseract would not used /usr/local/... ) than you should contact packager how (s)he packaged tesseract.

How to run text2image.cpp

I am working on tesseract. I want create new training language for tesseract. Please can anyone tell me what are the specific steps for training new language. And also tell me how to run text2image.cpp program. Thanks in advance.
To run text2image first compile and link text2image.cpp using an appropriate C++ toolchain then run the executable with the appropriate text file as an input. Alternatively you can download a windows installer which will give you an executable to use rather than the .cpp.
Instructions on building the tesseract tool chain are here and on how to train for currently unsupported languages is here.

building MATLAB jpeg toolbox with libjpeg8d

This question is related to another question I asked here:
Error reading image using jpeg_read from Matlab's jpeg toolbox
I've been trying to compile the jpeg toolbox under Windows 7 (using the commands Shai provided in the answer to the question I posted), but I get the following error:
jpeg_read.c(52) : fatal error C1083: Cannot open include file: 'jerror.h': No such file or directory
which I believe happened because I haven't built libjpeg. I tried to build libjpeg6b like jpegtoolbox's README says, but I couldn't find a clear guide on how to do it on Windows with visual studio 2010 (and libjpeg's install document doesn't help much), so I ended building libjpeg8d.
My question is if there's any way to use libjpeg8d to compile the jpeg toolbox. I've tried running the command:
mex -I<IJGPATH> jpeg_read.c <LIBJPEG>
with IJGPATH being my libjpeg8d installation folder and LIBJPEG being the path to the jpeg.lib file, inside IJGPATH/Release, but I still get the same missing jerror.h error as before.
Thanks in advance.
Jpeg toolbox contains Matlab routines for manipulating JPEG files. While Matlab's built-in IMREAD and IMWRITE functions provide basic conversion between JPEG files and image arrays, they do not provide access to the details of the JPEG image, such as the JPEG coefficients or the quantization tables.
The routines in this package provide additional functionality for directly accessing the contents of JPEG files from Matlab, including the Discrete Cosine Transform (DCT) coefficients, quantization tables, Huffman coding tables, color space information, and comment markers. The toolbox can be added to Matlab to use the functions.
First check whether the following are installed in your system.
1.Microsoft windows sdk7
2.Microsoft visual c++ 2010 express
if they are not installed, Download and install them in the order specified.
Note: Before installing windows sdk, uninstall any redistributalbe packages of visual c++ 2010. Else windows sdk produces problem during installation and the install fails. During installation dont forget to check the 'x64 Libraries' for 64bit OS and 'x86 Libraries' for 32bit OS, under Windows Headers and Libraries. This allows to use compiler tools for 64 bit operating systems. Once it overs, install visual c++. These tools are free and available at microsoft website. Both online and offline installers are available.
Now comes the real integration process
Step 1: Download the jpeg toolbox and extract it to a separate folder (eg. jpeg)
Download jpeg toolbox
Step 2: Download jpeg source files and extract it to a separate folder (eg.jpegsrc). In the folder jpeg-6b which is in jpegsrc, rename jconfig.vc to jconfig.h and makefile.vc to makefile
Download jpegsrc file
Step 3: From start->All programs-> Microsoft windows sdk , open the command prompt. This opens the 'windows x64 debug environment'. Navigate to the jpeg-6b folder which is inside the jpegsrc folder which is extracted at step 2. Run the command 'nmake clean all' without quotes. This creates the libjpeg.lib file in the same folder.
Step 4: Now open the matlab and choose the workfolder as jpeg, created in step 1.
Now in the command windows run the following commands,
mex -I jpeg_read.c
mex -I jpeg_write.c
Replace with the path to the IJG jpeg-6b directory created in step 2, and
with the full path to the IJG code library file (libjpeg.lib).
To use the jpeg_read and jpeg_write functions copy jpeg_read.mexw64 and jpeg_write.mexw64 files created above to your work directory. Dont uninstall visual c++ compiler or it wont work.
If you are trying to work in Windows, you need to rename jerror.vc to jerror.h
Also, when you mex the files, you need to edit the jpeg_read.c and jpeg_write.c from include to include "jerror.h"

Tesseract and Tess4J

I have a question regarding the tesseract training.
I am currently using Tess4J in order to integrate tesseract within my java program.
Reading on the tesseract wiki page on tesseract training (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3), one is able to train tesseract using training images with various combinations and fonts.
Is it possible for me to have just a "normal" tesseract 3.02 (windows or unix) installation constructing these lang.traineddata files, and afterwards just include this in my tessdata folder utilized by my Tess4J wrapper from my java program. Or is the Tess4J limitied to the included language data for English, and sample images that are bundled with the program?
If so, is it possible to include these in to my Tess4J build in some other way?
As it is just a wrapper of Tesseract OCR engine, it accepts any standard issue or custom traineddata files. You can find the standard traineddata at https://github.com/tesseract-ocr/tessdata.