Preserve interword spaces in Pytesseract - tesseract

I'm trying to get pytesseract to preserve interword spacing on an image. This is especially important in scanning poetry.
from PIL import Image
import pytesseract
img1 = Image.open(file)
custom_config = r'-c preserve_interword_spaces=1 --psm 4'
str4 = pytesseract.image_to_string(img1, config=custom_config)
I have also tried all types of psm configurations and other config options. I'm also using the most uptodate version of pytesseract which is 0.3.7.
This question has already been asked many times. Most notably here:
Preserving Spaces in Tesseract
However, the solution is not satisfactory. It is recommended to see the following page:
https://github.com/tesseract-ocr/tesseract/issues/781
But at that page they assert that the problem has been solved here
https://github.com/tesseract-ocr/tesseract/commit/e62e8f5f802c0d8f3dd67da993327cdafaee9763
But on that page it seems that you have to upgrade to tesseract 5.0 and I can't figure out how to do that on a mac, since brew install only installs tesseract 4.0.
I think if I could install tesseract 5.0 then that might solve the problem.
##################
UPDATE
Ok, I have confirmation on another site that I do have to upgrade to Tesseract 5.0. brew install does not enable that on a mac. So I guess I have to learn how to pull tesseract 5.0 straight from github which I'm not very good at doing.

You probably will have to clone the repository and build it.
https://github.com/tesseract-ocr/tesseract
https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos
Btw, preserve_interword_spaces works in Tesseract 4.1.1 also, if you can install that version.

Related

Why doesn't Tesseract recognize a simple word?

I am experimenting with Tesseract and failed already on the second attempt.
Here is the image file:
The result is always an empty string. The code looks as follows:
from pytesseract import image_to_string
image_file = Image.open('image.png')
print(image_to_string(image_file))
I tried also directly from terminal
tesseract image.png out
again with no success.
Is there something wrong with this image or am I doing something wrong?
I am using Ubuntu 14.04 with Tesseract installed with apt-get as well as pytesseract installed using pip.
Python version : 3.4
After applying a grayscale or monochrome filter, it produced "DDownload!".
In this document I found interesting link to these advices which should be helpful. Look at section "4 Prepare Images" in the advices page.
A more advanced OCR program would do this itself. No doubt Tesseract
will improve.

Tesseract auxiliary commands

I installed Tesseract and its basic functionality is fine. But when I try following this instruction on language file generation, tesseract-dependent commands like wordlist2dawg are "not found" by the shell.
Q: How do I install Tesseract with all these commands available? It's my understanding that they should work once I installed Tesseract, but it isn't the case. I installed Tesseract via port install tesseract, might be that I missed something.
Q2: How do I actually train Tesseract? I know it's an opaque topic; most results I get online are 3 years old at best, and it's difficult to figure out the exact training mechanism.
You'll need to build the training tools and then follow the instructions in the page.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#building-the-training-tools

Python defaults to wrong version; can't find scipy

I am new to python, hoping to use it for scientific computation, data acquisition, etc. My ignorance is near total.
I am using a macbook pro, running OSX 10.9.5.
I first installed python 2.7, numpy, and matplotlib; can't remember where they came from. They seem to sit in /Library/Frameworks/Python.framework....
All was OK, until I realized I need scipy also. So, I installed the entire scipy stack from scipy.org, using 'sudo port install py27-numpy py27-scipy py27-matplotlib py27-ipython +notebook py27-pandas py27-sympy py27-nose', after having first installed xcode and the developer tools.
This new installation is located in \opt\local\var\macports\software...
Here's the question: When I run python in a terminal, it always defaults to the original installation. scipy, in particular, cannot be found. I suppose this is a path problem, but I am out of my depth here. Can someone help?

enthought mahotas.imread cannot find freeimage

I'm new to python and it was recommended that I use Canopy. I'm trying to follow along with this tutorial, but I get stuck at the mahotas.imread line. I get an error saying that ends with this:
Full error was: mahotas.freeimage: could not find libFreeImage in any
of the following directories:
'/Users/RJD/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/mahotas',
'/lib', '/usr/lib', '/usr/local/lib', '/opt/local/lib'
I've added the mahotas package via the package manager to no avail. Also tried the steps here, with no different result.
I am actually able to find 'freeimage.py' and 'freeimage.pyc' in '/Users/RJD/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/mahotas'. How do I go about telling Canopy that it is there?!
Any help would be very much appreciated.
Cheers,
R
Author of mahotas here:
Mahotas itself does not have the functionality to read in images. imread is just a wrapper around one of 3 backends:
mahotas-imread (i.e., https://pypi.python.org/pypi/imread)
FreeImage (this was the original version and if you have such an old version [0.7.1 is from Jan '12], it might still only support FreeImage)
matplotlib (which only supports PNG & JPEG)
Thus, you need to install one of the packages above.
To be clear, there is no "enthought mahotas". Mahotas is not in the Enthought package repository but in our "Community" (PyPi mirror) repo of 11,000 untested ("as is") packages, as you can see by the "PyPI" logo in the Package Manager (sorry, that's not at all obvious, we'll fix this!) We will be updating this repo later this year. The version of mahotas in that PyPI repo is 0.7.1, whereas the current version of mahotas on PyPI is 1.0.2. So that avenue is not useful for now.
When you say that you tried the steps in the cmu.edu document, was that after uninstalling the old PyPI version just mentioned and going through each step mentioned in that document?

Django OS X Wrong JPEG library version: library is 80, caller expects 62 sorl.thumbnail

Im using sorl.thumbnail for django locally on my mac and have been having trouble with PIL, but today i finally managed to get it installed - was some trouble with libjpeg.
I can now upload and use images - but I cant resize them using sorl.thumbnail.
When i try i get the following error:
Wrong JPEG library version: library is 80, caller expects 62
Does anyone know a good solution for this.
I dont know wether whatever sorl uses requires an earlier version of libjpeg or wether there is some ghost install of something still left behind from all of my tries with various methods.
I have :
PIL 1.1.7
libjpeg 8.
anyone know an approach?
For the benefit of the people from the future who are encountering this error and don't know why, I'd like to post my findings. I hope to give a general understanding of what's gone wrong since the exact commands to fix it may be different on your machine than on my OSX Lion install.
First, since it's easy to get lost in the potential solutions, it's important to understand that the error message is correct when it says Wrong JPEG library version: library is 80, caller expects 62 or some other combination of 62, 70, and 80. These numbers correspond to the different incompatible versions of libjpeg. There are two moving pieces here, the dynamically loaded jpeg library, and the PIL (or Pillow) install. What the error message is saying is that your PIL install was compiled with headers from libjpeg version 6.2, but when it goes to load up the actual shared library, it's being linked to version 8.0.
The fix is to download, build, and install the libjpeg version you want (any will do, though the later versions build easier on OSX Lion):
wget http://www.ijg.org/files/jpegsrc.v8d.tar.gz
tar xzf jpegsrc*
cd jpeg-*
./configure
make
sudo make install
This should drop 2 files of note in '/usr/local/'. Namely /usr/local/lib/libjpeg.8.dylib and /usr/local/include/jpeglib.h. Now we just have to get PIL (or Pillow) to use these two files at install time, and we're home free. I know there's a better way to do this, but the hack (as recommended by the PIL docs) is to edit the setup.py file of the PIL distribution before you install it. You may get away with just setting JPEG_ROOT = libinclude('/usr/local') near the top of setup.py, though further directory manipulation may be necessary elsewhere in the file.
As you fiddle with the paths, you have to make sure PIL does a full rebuild before you test out whether it linked up to the right library or not. I used a command like rm -rf build && python setup.py install to make sure the library was always freshly linked to the current path I was testing.
I'm sorry this is a rambling answer, but it was very disheartening to have tried every other copy & paste solution out there and have none of them work. Hopefully this answer keeps at least a few folks from wasting numerous hours in search of a simplistic solution.
Good Luck!
If you have macports installed, you should do a:
$ sudo port selfupdate
$ sudo port install py27-pil
It's easier than the easy_install method since macports install the right dependencies.
I had a slightly different problem than the OP, but I wanted to share my solution here to help someone in the future.
OS: OSX El Capitan
I installed libjpeg-turbo from the precompiled binaries on their website. However, I did not know that I already had a different version of libjpeg installed on my mac. I was building my c file like this gcc myfile.c -o myfile.out -L /opt/libjpeg-turbo/lib -ljpeg. This got the library from the correct location, but the the linker was getting the included header file jpeglib.h from the pre-installed location. I changed my build command to this: gcc myfile.c -o myfile.out -I/opt/libjpeg-turbo/include/ -L /opt/libjpeg-turbo/lib -ljpeg and it worked. No more library is 80, caller expects 62!
Like a previous answer, I had a slightly different problem than the OP, but I wanted to share my solution here to help someone in the future.
The only thing that worked for me was forcing pip to build pillow from source after installing the dev version of the needed libraries (my code was editing a jpg and adding a label using a custom font). This was on a ARM based embedded device running Ubuntu Linux using Python 3.7.3
apt-get install -y libjpeg-dev libfreetype6-dev
pip3 install pillow --global-option="build_ext" --global-option="--enable-jpeg" --global-option="--enable-freetype"