Where do tesseract OCR files save to? - tesseract

I have a question which might be obvious, but I can’t find it in the FAQ or forums. I am using Tesseract OCR on Mac Mojave. I am a new user. Using the following command and seeing the following in the terminal:
tesseract /users/SamBell/Desktop/handwriting-test2.jpg out
Warning: Invalid Resolution 0 dpi. Using 70 instead.
Estimating resolution as 553
My question is - where is the output file saving to?

out.txt would be saved in the current directory. Execute pwd command in the Terminal to display the path to the directory.

Related

How to run system commands in Octave for OPM flow reservoir simulator?

I do research in oil simulation, I normally use a simulator called Eclipse from a company called Schlumberger and I was able to use it from my scripts from Matlab using the following command.
% file name 'ICFM.DATA';
system(['eclrun',' eclipse ', C:Path\ICFM.DATA]); % Command to run ECLIPSE
Now I had installed a new Free simulator (OPM.org) in linux and I am using Octave for programming. but I am unable to find out how to run simulator from Octave.
The simulator can be run simply by writing
flow ICFM.DATA
and the results using a command
ecl_summary ICFM.DATA
I want to be able to run and get the results from with in Octave but I have not being able as in Matlab.
Any suggestions? someone?
Assuming both the flow and ecl_summary commands are on your system path (i.e. the "linux" path, not in octave), then it should simply be a matter of:
system('flow /my/path/to/ICFM.DATA');
system('ecl_summary /my/path/to/ICFM.DATA');
(where you should change /my/path/to with whatever path your data file is in).
I found that I am able to run the simulation using the syntaxes
unix('flow ICFM.DATA')
This is in Matlab R2017b for Ubuntu 16.04
Initially I got an error as the output was reporting:
....'GLIBCXX_3.4.21' not found
All the problem was due to a Matlab issue that was solved using the answer on
https://askubuntu.com/questions/719028/version-glibcxx-3-4-21-not-found
which was to type:
LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libstdc++.so.6" matlab
on the terminal.

Unable to train Tesseract in Ubuntu

I am new in Tesseract, I am trying to train Tesseract in Ubuntu, I am using JTessBoxEditor for trainer, I have successfully generated .tif and .box file, but I am unable to Trainer data using it, during JTessBoxEditor Trainer, I don't know which Tesseract Executable location need to select for training data.
Can some please help here. Thanks in advance.
You'd need to build the Tesseract training tools first. Their installation directory can be determined by executing which tesseract command.

running into issues training tesseract

I am new to tesseract and am a bit confused with the different directories in the github page.
The tesserac-ocr code base is what I installed. That installed a tessdata directory in /usr/local/share/tessdata/
So now while training tesseract I run the following command -
# tesseract img.tif img box.train
I get the following error
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Obviously its not able to find the tessdata folder.
So now I obtained the tessdata directory from github (https://github.com/tesseract-ocr/tessdata). Then pointed the TESSDATA_PREFIX to the downloaded tessdata from github. Does not change anything. I get the following error -
Tesseract Open Source OCR Engine v3.03 with Leptonica
read_params_file: Can't open box.train
So my question is what should the tessdata be pointed to? Where does tesseract obtain the box.train from in the training command?
One of most stupid things you can do as novice is try to train tesseract ;-)
Next: 3.03 version is not in official github.com repo (btw: 3.03 was never official released... it was just Ubuntu that made that release.)
Next: if you installed tesseract correctly (from source) box.train is installed. You you installed from Ubuntu packages/repo (I do not think so, because in that case tesseract would not used /usr/local/... ) than you should contact packager how (s)he packaged tesseract.

Error in start up MATLAB 2014a (32bit)

After setting up the program ,I meet a fatal error on Startup matlab .As following :
Failed to start the Java Virtual machine - JNI error: -3
I want to know where is error and how to solve it .Please help me ,Thanks !
PS: My OS is windows 7 (32bit) ,and jdk 1.8(32bit) is set up successfully (Certainly PATH and CLASSPATH is set OK).
When you launch the application, does that error display in a dialog box, or within the MATLAB console? Does MATLAB launch if you use the '-nojvm' flag (in your shortcut path or from within a Command Prompt window)? If not, does it launch if you use a '-nodisplay' flag?
Definitely make sure you don't have a 'matlab_java' environment variable set, and if you are able to launch MATLAB, you should run the following commands:
>> rehash toolboxcache;
>> restoredefaultpath;
>> savepath;
After running the commands, restart MATLAB. When MATLAB launches, do you still see the error?
If you aren't able to get MATLAB to run, can you look within your %temp% directory for an error log? I don't recall the specific naming convention, but you should be able to find it based on it's timestamp.
Hope this helps, and if it doesn't, try calling the MathWorks Installation and Licensing team. As long as you have a real license, they'll help you.
Disclaimer: I'm a former member of the recommended team above, but have no current affiliation with MathWorks.

Ephesoft error with learning tiff documents that have been converted from PDF

I am using the Ephesoft Community edition on a windows server 2003 on AWS instance. I am having issues with ephesoft reading certain tiff documents. I have about 100 different tiff documents and about 70% of them work. These tiff documents were originally PDF's that we have converted using the lastest version of ghostscript and cleaned up using imagemagick from ephesoft. We are using the following commands with ghostscript
-dNOPAUSE -r300 -sDEVICE=tiffg4 -dBATCH
with imagemagick we are doing the following command
-compress group4
When learning one of the tiff files that isn't working we are getting the following error in the log files
Drop Box Link to Stack Trace
And this is one of the Tiff document we are trying to have ephesoft learn
Drop Box Link to Tiff Document
Is there something that I can do with ghostscript, imagemagick or any other software to fix this; or do I need to modify ephesoft in some way?
I found the solution by doing some more research.
The problem didn't involve Ghostscript or Imagmagick. It involved Tesseract and creating the HOCR file. When Tesseract is creating the hocr file it is resolving the value of Texas as Te>. The community edition of Ephesoft cannot handle the special xml character like that and would throw the error as a result.
The solution was to set a Tesseract property of blacklisting the <> symbols so that Tesseract would not include those or resolve to those. My PDF's seem to be working correctly now and I am able to process them.