Tesseract and Tess4J - tesseract

I have a question regarding the tesseract training.
I am currently using Tess4J in order to integrate tesseract within my java program.
Reading on the tesseract wiki page on tesseract training (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3), one is able to train tesseract using training images with various combinations and fonts.
Is it possible for me to have just a "normal" tesseract 3.02 (windows or unix) installation constructing these lang.traineddata files, and afterwards just include this in my tessdata folder utilized by my Tess4J wrapper from my java program. Or is the Tess4J limitied to the included language data for English, and sample images that are bundled with the program?
If so, is it possible to include these in to my Tess4J build in some other way?

As it is just a wrapper of Tesseract OCR engine, it accepts any standard issue or custom traineddata files. You can find the standard traineddata at https://github.com/tesseract-ocr/tessdata.

Related

Unable to train Tesseract in Ubuntu

I am new in Tesseract, I am trying to train Tesseract in Ubuntu, I am using JTessBoxEditor for trainer, I have successfully generated .tif and .box file, but I am unable to Trainer data using it, during JTessBoxEditor Trainer, I don't know which Tesseract Executable location need to select for training data.
Can some please help here. Thanks in advance.
You'd need to build the Tesseract training tools first. Their installation directory can be determined by executing which tesseract command.

Tesseract auxiliary commands

I installed Tesseract and its basic functionality is fine. But when I try following this instruction on language file generation, tesseract-dependent commands like wordlist2dawg are "not found" by the shell.
Q: How do I install Tesseract with all these commands available? It's my understanding that they should work once I installed Tesseract, but it isn't the case. I installed Tesseract via port install tesseract, might be that I missed something.
Q2: How do I actually train Tesseract? I know it's an opaque topic; most results I get online are 3 years old at best, and it's difficult to figure out the exact training mechanism.
You'll need to build the training tools and then follow the instructions in the page.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#building-the-training-tools

Ephesoft error with learning tiff documents that have been converted from PDF

I am using the Ephesoft Community edition on a windows server 2003 on AWS instance. I am having issues with ephesoft reading certain tiff documents. I have about 100 different tiff documents and about 70% of them work. These tiff documents were originally PDF's that we have converted using the lastest version of ghostscript and cleaned up using imagemagick from ephesoft. We are using the following commands with ghostscript
-dNOPAUSE -r300 -sDEVICE=tiffg4 -dBATCH
with imagemagick we are doing the following command
-compress group4
When learning one of the tiff files that isn't working we are getting the following error in the log files
Drop Box Link to Stack Trace
And this is one of the Tiff document we are trying to have ephesoft learn
Drop Box Link to Tiff Document
Is there something that I can do with ghostscript, imagemagick or any other software to fix this; or do I need to modify ephesoft in some way?
I found the solution by doing some more research.
The problem didn't involve Ghostscript or Imagmagick. It involved Tesseract and creating the HOCR file. When Tesseract is creating the hocr file it is resolving the value of Texas as Te>. The community edition of Ephesoft cannot handle the special xml character like that and would throw the error as a result.
The solution was to set a Tesseract property of blacklisting the <> symbols so that Tesseract would not include those or resolve to those. My PDF's seem to be working correctly now and I am able to process them.

How to run text2image.cpp

I am working on tesseract. I want create new training language for tesseract. Please can anyone tell me what are the specific steps for training new language. And also tell me how to run text2image.cpp program. Thanks in advance.
To run text2image first compile and link text2image.cpp using an appropriate C++ toolchain then run the executable with the appropriate text file as an input. Alternatively you can download a windows installer which will give you an executable to use rather than the .cpp.
Instructions on building the tesseract tool chain are here and on how to train for currently unsupported languages is here.

building MATLAB jpeg toolbox with libjpeg8d

This question is related to another question I asked here:
Error reading image using jpeg_read from Matlab's jpeg toolbox
I've been trying to compile the jpeg toolbox under Windows 7 (using the commands Shai provided in the answer to the question I posted), but I get the following error:
jpeg_read.c(52) : fatal error C1083: Cannot open include file: 'jerror.h': No such file or directory
which I believe happened because I haven't built libjpeg. I tried to build libjpeg6b like jpegtoolbox's README says, but I couldn't find a clear guide on how to do it on Windows with visual studio 2010 (and libjpeg's install document doesn't help much), so I ended building libjpeg8d.
My question is if there's any way to use libjpeg8d to compile the jpeg toolbox. I've tried running the command:
mex -I<IJGPATH> jpeg_read.c <LIBJPEG>
with IJGPATH being my libjpeg8d installation folder and LIBJPEG being the path to the jpeg.lib file, inside IJGPATH/Release, but I still get the same missing jerror.h error as before.
Thanks in advance.
Jpeg toolbox contains Matlab routines for manipulating JPEG files. While Matlab's built-in IMREAD and IMWRITE functions provide basic conversion between JPEG files and image arrays, they do not provide access to the details of the JPEG image, such as the JPEG coefficients or the quantization tables.
The routines in this package provide additional functionality for directly accessing the contents of JPEG files from Matlab, including the Discrete Cosine Transform (DCT) coefficients, quantization tables, Huffman coding tables, color space information, and comment markers. The toolbox can be added to Matlab to use the functions.
First check whether the following are installed in your system.
1.Microsoft windows sdk7
2.Microsoft visual c++ 2010 express
if they are not installed, Download and install them in the order specified.
Note: Before installing windows sdk, uninstall any redistributalbe packages of visual c++ 2010. Else windows sdk produces problem during installation and the install fails. During installation dont forget to check the 'x64 Libraries' for 64bit OS and 'x86 Libraries' for 32bit OS, under Windows Headers and Libraries. This allows to use compiler tools for 64 bit operating systems. Once it overs, install visual c++. These tools are free and available at microsoft website. Both online and offline installers are available.
Now comes the real integration process
Step 1: Download the jpeg toolbox and extract it to a separate folder (eg. jpeg)
Download jpeg toolbox
Step 2: Download jpeg source files and extract it to a separate folder (eg.jpegsrc). In the folder jpeg-6b which is in jpegsrc, rename jconfig.vc to jconfig.h and makefile.vc to makefile
Download jpegsrc file
Step 3: From start->All programs-> Microsoft windows sdk , open the command prompt. This opens the 'windows x64 debug environment'. Navigate to the jpeg-6b folder which is inside the jpegsrc folder which is extracted at step 2. Run the command 'nmake clean all' without quotes. This creates the libjpeg.lib file in the same folder.
Step 4: Now open the matlab and choose the workfolder as jpeg, created in step 1.
Now in the command windows run the following commands,
mex -I jpeg_read.c
mex -I jpeg_write.c
Replace with the path to the IJG jpeg-6b directory created in step 2, and
with the full path to the IJG code library file (libjpeg.lib).
To use the jpeg_read and jpeg_write functions copy jpeg_read.mexw64 and jpeg_write.mexw64 files created above to your work directory. Dont uninstall visual c++ compiler or it wont work.
If you are trying to work in Windows, you need to rename jerror.vc to jerror.h
Also, when you mex the files, you need to edit the jpeg_read.c and jpeg_write.c from include to include "jerror.h"