Why doesn't Tesseract recognize a simple word?

Why doesn't Tesseract recognize a simple word? - tesseract

I am experimenting with Tesseract and failed already on the second attempt.
Here is the image file:
The result is always an empty string. The code looks as follows:
from pytesseract import image_to_string
image_file = Image.open('image.png')
print(image_to_string(image_file))
I tried also directly from terminal
tesseract image.png out
again with no success.
Is there something wrong with this image or am I doing something wrong?
I am using Ubuntu 14.04 with Tesseract installed with apt-get as well as pytesseract installed using pip.
Python version : 3.4

After applying a grayscale or monochrome filter, it produced "DDownload!".

In this document I found interesting link to these advices which should be helpful. Look at section "4 Prepare Images" in the advices page.
A more advanced OCR program would do this itself. No doubt Tesseract
will improve.

Related

Preserve interword spaces in Pytesseract

I'm trying to get pytesseract to preserve interword spacing on an image. This is especially important in scanning poetry.
from PIL import Image
import pytesseract
img1 = Image.open(file)
custom_config = r'-c preserve_interword_spaces=1 --psm 4'
str4 = pytesseract.image_to_string(img1, config=custom_config)
I have also tried all types of psm configurations and other config options. I'm also using the most uptodate version of pytesseract which is 0.3.7.
This question has already been asked many times. Most notably here:
Preserving Spaces in Tesseract
However, the solution is not satisfactory. It is recommended to see the following page:
https://github.com/tesseract-ocr/tesseract/issues/781
But at that page they assert that the problem has been solved here
https://github.com/tesseract-ocr/tesseract/commit/e62e8f5f802c0d8f3dd67da993327cdafaee9763
But on that page it seems that you have to upgrade to tesseract 5.0 and I can't figure out how to do that on a mac, since brew install only installs tesseract 4.0.
I think if I could install tesseract 5.0 then that might solve the problem.
##################
UPDATE
Ok, I have confirmation on another site that I do have to upgrade to Tesseract 5.0. brew install does not enable that on a mac. So I guess I have to learn how to pull tesseract 5.0 straight from github which I'm not very good at doing.

You probably will have to clone the repository and build it.
https://github.com/tesseract-ocr/tesseract
https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos
Btw, preserve_interword_spaces works in Tesseract 4.1.1 also, if you can install that version.

Tesseract auxiliary commands

I installed Tesseract and its basic functionality is fine. But when I try following this instruction on language file generation, tesseract-dependent commands like wordlist2dawg are "not found" by the shell.
Q: How do I install Tesseract with all these commands available? It's my understanding that they should work once I installed Tesseract, but it isn't the case. I installed Tesseract via port install tesseract, might be that I missed something.
Q2: How do I actually train Tesseract? I know it's an opaque topic; most results I get online are 3 years old at best, and it's difficult to figure out the exact training mechanism.

You'll need to build the training tools and then follow the instructions in the page.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#building-the-training-tools

running into issues training tesseract

I am new to tesseract and am a bit confused with the different directories in the github page.
The tesserac-ocr code base is what I installed. That installed a tessdata directory in /usr/local/share/tessdata/
So now while training tesseract I run the following command -
# tesseract img.tif img box.train
I get the following error
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Obviously its not able to find the tessdata folder.
So now I obtained the tessdata directory from github (https://github.com/tesseract-ocr/tessdata). Then pointed the TESSDATA_PREFIX to the downloaded tessdata from github. Does not change anything. I get the following error -
Tesseract Open Source OCR Engine v3.03 with Leptonica
read_params_file: Can't open box.train
So my question is what should the tessdata be pointed to? Where does tesseract obtain the box.train from in the training command?

One of most stupid things you can do as novice is try to train tesseract ;-)
Next: 3.03 version is not in official github.com repo (btw: 3.03 was never official released... it was just Ubuntu that made that release.)
Next: if you installed tesseract correctly (from source) box.train is installed. You you installed from Ubuntu packages/repo (I do not think so, because in that case tesseract would not used /usr/local/... ) than you should contact packager how (s)he packaged tesseract.

Ephesoft error with learning tiff documents that have been converted from PDF

I am using the Ephesoft Community edition on a windows server 2003 on AWS instance. I am having issues with ephesoft reading certain tiff documents. I have about 100 different tiff documents and about 70% of them work. These tiff documents were originally PDF's that we have converted using the lastest version of ghostscript and cleaned up using imagemagick from ephesoft. We are using the following commands with ghostscript
-dNOPAUSE -r300 -sDEVICE=tiffg4 -dBATCH
with imagemagick we are doing the following command
-compress group4
When learning one of the tiff files that isn't working we are getting the following error in the log files
Drop Box Link to Stack Trace
And this is one of the Tiff document we are trying to have ephesoft learn
Drop Box Link to Tiff Document
Is there something that I can do with ghostscript, imagemagick or any other software to fix this; or do I need to modify ephesoft in some way?

I found the solution by doing some more research.
The problem didn't involve Ghostscript or Imagmagick. It involved Tesseract and creating the HOCR file. When Tesseract is creating the hocr file it is resolving the value of Texas as Te>. The community edition of Ephesoft cannot handle the special xml character like that and would throw the error as a result.
The solution was to set a Tesseract property of blacklisting the <> symbols so that Tesseract would not include those or resolve to those. My PDF's seem to be working correctly now and I am able to process them.

QR Code generation in shell / mac terminal

I want to create QR codes for a project I'm working on in applescript the resulting qr will be placed in the indesign document. I have found that there is a plugin for indesign but I suspect that requires user interaction.
So I've been search for how to generate the qr using a shell command. I've found things related to php and rails and even coldfusion but none of those will fit the bill on this. I need to generate them using shell command so image events or perl basically anything I can run from the command line that comes with the mac os
thanks for your help.
antotheer
I wonder if I could call a url using curl or somthing to get one ?

For doing something similar, we use libqrencode.
It's a c library for generating QR codes, but it comes with a command line utility (called qrencode) which lets you generate QR codes from a string, e.g.:
./qrencode -o /tmp/foo.png "This is the input string"
It supports most options you'd probably want (e.g. error correction level, image size, etc.).
We've used it in production for a year or two, with no problems.
I've only run it on linux systems, but there's no reason you shouldn't be able to compile it on Mac OS, assuming you have a compiler and build tools installed (and any libraries it depends on of course).

As Riccardo Cossu mentioned please use homebrew:
brew install qrencode
qrencode -o so.png "http://stackoverflow.com"

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why doesn't Tesseract recognize a simple word? - tesseract

After applying a grayscale or monochrome filter, it produced "DDownload!".

In this document I found interesting link to these advices which should be helpful. Look at section "4 Prepare Images" in the advices page. A more advanced OCR program would do this itself. No doubt Tesseract will improve.

Related

Preserve interword spaces in Pytesseract

Tesseract auxiliary commands

running into issues training tesseract

Ephesoft error with learning tiff documents that have been converted from PDF

QR Code generation in shell / mac terminal

Categories

Resources