running into issues training tesseract - tesseract

I am new to tesseract and am a bit confused with the different directories in the github page.
The tesserac-ocr code base is what I installed. That installed a tessdata directory in /usr/local/share/tessdata/
So now while training tesseract I run the following command -
# tesseract img.tif img box.train
I get the following error
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Obviously its not able to find the tessdata folder.
So now I obtained the tessdata directory from github (https://github.com/tesseract-ocr/tessdata). Then pointed the TESSDATA_PREFIX to the downloaded tessdata from github. Does not change anything. I get the following error -
Tesseract Open Source OCR Engine v3.03 with Leptonica
read_params_file: Can't open box.train
So my question is what should the tessdata be pointed to? Where does tesseract obtain the box.train from in the training command?

One of most stupid things you can do as novice is try to train tesseract ;-)
Next: 3.03 version is not in official github.com repo (btw: 3.03 was never official released... it was just Ubuntu that made that release.)
Next: if you installed tesseract correctly (from source) box.train is installed. You you installed from Ubuntu packages/repo (I do not think so, because in that case tesseract would not used /usr/local/... ) than you should contact packager how (s)he packaged tesseract.

Related

Where do tesseract OCR files save to?

I have a question which might be obvious, but I can’t find it in the FAQ or forums. I am using Tesseract OCR on Mac Mojave. I am a new user. Using the following command and seeing the following in the terminal:
tesseract /users/SamBell/Desktop/handwriting-test2.jpg out
Warning: Invalid Resolution 0 dpi. Using 70 instead.
Estimating resolution as 553
My question is - where is the output file saving to?
out.txt would be saved in the current directory. Execute pwd command in the Terminal to display the path to the directory.

I can't run VLFeat on matlab [duplicate]

I downloaded the VLFeat lib from its git repository! I followed the instruction in the installation page. But when I ran the vl_setup command I got this warning:
Warning: Name is nonexistent or not a directory: ..\Adv. 3D
Computer Vision\vlfeat\toolbox\mex\mexw32
So follow some steps mentioned in MathWorks website, like 1,2,3 but the problem didn't solve. I trace the vl_setup.m file and according to the error statement it can not find the mexw32 folder. but there wasn't any folder like that when I downloaded that lib.
I'm using Windows 7, Matlab 2013a
Did you compile the mex files first through vl_compile.m? Once you compile the code, the mex directory should appear with the MEX files associated with your OS. Those setup instructions assume you have the binary distribution but you downloaded the source from github.
Consult VLFeat's compilation instructions for Windows here: http://www.vlfeat.org/compiling-windows.html

Unable to train Tesseract in Ubuntu

I am new in Tesseract, I am trying to train Tesseract in Ubuntu, I am using JTessBoxEditor for trainer, I have successfully generated .tif and .box file, but I am unable to Trainer data using it, during JTessBoxEditor Trainer, I don't know which Tesseract Executable location need to select for training data.
Can some please help here. Thanks in advance.
You'd need to build the Tesseract training tools first. Their installation directory can be determined by executing which tesseract command.

building MATLAB jpeg toolbox with libjpeg8d

This question is related to another question I asked here:
Error reading image using jpeg_read from Matlab's jpeg toolbox
I've been trying to compile the jpeg toolbox under Windows 7 (using the commands Shai provided in the answer to the question I posted), but I get the following error:
jpeg_read.c(52) : fatal error C1083: Cannot open include file: 'jerror.h': No such file or directory
which I believe happened because I haven't built libjpeg. I tried to build libjpeg6b like jpegtoolbox's README says, but I couldn't find a clear guide on how to do it on Windows with visual studio 2010 (and libjpeg's install document doesn't help much), so I ended building libjpeg8d.
My question is if there's any way to use libjpeg8d to compile the jpeg toolbox. I've tried running the command:
mex -I<IJGPATH> jpeg_read.c <LIBJPEG>
with IJGPATH being my libjpeg8d installation folder and LIBJPEG being the path to the jpeg.lib file, inside IJGPATH/Release, but I still get the same missing jerror.h error as before.
Thanks in advance.
Jpeg toolbox contains Matlab routines for manipulating JPEG files. While Matlab's built-in IMREAD and IMWRITE functions provide basic conversion between JPEG files and image arrays, they do not provide access to the details of the JPEG image, such as the JPEG coefficients or the quantization tables.
The routines in this package provide additional functionality for directly accessing the contents of JPEG files from Matlab, including the Discrete Cosine Transform (DCT) coefficients, quantization tables, Huffman coding tables, color space information, and comment markers. The toolbox can be added to Matlab to use the functions.
First check whether the following are installed in your system.
1.Microsoft windows sdk7
2.Microsoft visual c++ 2010 express
if they are not installed, Download and install them in the order specified.
Note: Before installing windows sdk, uninstall any redistributalbe packages of visual c++ 2010. Else windows sdk produces problem during installation and the install fails. During installation dont forget to check the 'x64 Libraries' for 64bit OS and 'x86 Libraries' for 32bit OS, under Windows Headers and Libraries. This allows to use compiler tools for 64 bit operating systems. Once it overs, install visual c++. These tools are free and available at microsoft website. Both online and offline installers are available.
Now comes the real integration process
Step 1: Download the jpeg toolbox and extract it to a separate folder (eg. jpeg)
Download jpeg toolbox
Step 2: Download jpeg source files and extract it to a separate folder (eg.jpegsrc). In the folder jpeg-6b which is in jpegsrc, rename jconfig.vc to jconfig.h and makefile.vc to makefile
Download jpegsrc file
Step 3: From start->All programs-> Microsoft windows sdk , open the command prompt. This opens the 'windows x64 debug environment'. Navigate to the jpeg-6b folder which is inside the jpegsrc folder which is extracted at step 2. Run the command 'nmake clean all' without quotes. This creates the libjpeg.lib file in the same folder.
Step 4: Now open the matlab and choose the workfolder as jpeg, created in step 1.
Now in the command windows run the following commands,
mex -I jpeg_read.c
mex -I jpeg_write.c
Replace with the path to the IJG jpeg-6b directory created in step 2, and
with the full path to the IJG code library file (libjpeg.lib).
To use the jpeg_read and jpeg_write functions copy jpeg_read.mexw64 and jpeg_write.mexw64 files created above to your work directory. Dont uninstall visual c++ compiler or it wont work.
If you are trying to work in Windows, you need to rename jerror.vc to jerror.h
Also, when you mex the files, you need to edit the jpeg_read.c and jpeg_write.c from include to include "jerror.h"

Tesseract and Tess4J

I have a question regarding the tesseract training.
I am currently using Tess4J in order to integrate tesseract within my java program.
Reading on the tesseract wiki page on tesseract training (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3), one is able to train tesseract using training images with various combinations and fonts.
Is it possible for me to have just a "normal" tesseract 3.02 (windows or unix) installation constructing these lang.traineddata files, and afterwards just include this in my tessdata folder utilized by my Tess4J wrapper from my java program. Or is the Tess4J limitied to the included language data for English, and sample images that are bundled with the program?
If so, is it possible to include these in to my Tess4J build in some other way?
As it is just a wrapper of Tesseract OCR engine, it accepts any standard issue or custom traineddata files. You can find the standard traineddata at https://github.com/tesseract-ocr/tessdata.