Forcing Tesseract to give some answer - tesseract

I am trying to recognize one line of handwritten digits. Currently I do some preprocessing with Python and OpenCV, split the image into connected components and feed these components to Tesseract with PSM=10 (page segmentation mode, 10 is "treat the image like a single character") and character whitelist restricted to "0123456789". I expect Tesseract to return garbage where my connected component segmentation fails and to return exactly one digit when my segmentation succeeds. Tesseract often returns nothing at all.
I have tried both pytesseract and python-tesseract as a Tesseract interface for Python. Pytesseract works by locating the executable tesseract.exe, running it with suitable parameters from the shell and collecting the answer. This is how I found out about my problem. After that, I tried python-tesseract, which implements a full-blown C API. Naturally, the result was the same.
Below is a sample of 5 images I fed into Tesseract separately (I've also uploaded the same images as separate files here):
I get 1,*,4,*,* on these images, * meaning that Tesseract returned only whitespace.
With other page segmentation modes, I get the following:
PSM_SINGLE_CHAR: 1*4**
PSM_SINGLE_BLOCK_VERT_TEXT: **43*
PSM_CIRCLE_WORD: 11***
PSM_SINGLE_LINE: 11491
PSM_AUTO: *****
PSM_SPARSE_TEXT: *****
PSM_SINGLE_WORD: 11499
PSM_AUTO_ONLY: *****
PSM_SINGLE_COLUMN: *****
PSM_SPARSE_TEXT_OS: *****
PSM_SINGLE_BLOCK: 11499
PSM_OSD_ONLY: *****
PSM_AUTO_OSD: *****
PSM_COUNT: 11499
Weirdly, when I run tesseract image.png image -l eng -psm 10 digits-only against these images, it returns *,*,4,9,*. (digits-only is tessedit_char_whitelist 0123456789)
How do I force Tesseract to give me some answer instead of nothing at all?

Related

Can Tesseract be used for Sinhala handwritten text recognition?

I wish to restore damaged Sinhala handwritten documents. Please let me know: Can Tesseract be used for Sinhala language also?
Checkout the tessdata folder the from tesseract-ocr GitHub repository:
There's sin.traineddata for the actual Sinhala language, and
there's script/Sinhala.traineddata for the Sinhala script.
Copy one of them (or both) to your tessdata folder, maybe located at C:\tesseract\tessdata on some Windows machine.
For example, running Tesseract from the command line, you can then use
tesseract myimage.png output -l sin
or
tesseract myimage.png output -l Sinhala
I took a screenshot of the Sinhala script Wikipedia page, and cropped the following part:
Both above commands result in the following output:
සිංහල අක්ෂර මාලාව
That seems fine to me, but I don't claim to be able to read or understand any Sinhala script or language!
So, in general: Yes, it seems, you can OCR Sinhala texts!
BUT: As for any script, and maybe even more difficult for non-Latin scripts, you probably won't get good results on handwritten texts. OCR on those texts is some field of research on its own.

pytesseract results different from tesseract command line results

I am trying to convert a scanned page to text using both pytesseract and tesseract command line on Ubuntu. The results are remarkably different (pytesseract performs way better than tesseract command line) and I am unable to understand why. I looked at the default values for the parameters and tried altering some of the parameter values in tesseract command line (like psm ) but I am unable to get the same result as pytesseract. Due to lack of proper documentation in pytesseract I am not able to figure out what default values for parameters are used.
Here is my pytesseract code
print(pytesseract.image_to_string(Image.open('test.tiff'))
Looking at the source code of pytesseract, it seems the image is always converted into a .bmp file.
Working with a .bmp file and psm of 6 at the command line with Tesseract gives same result as pytesseract.
Also, tesseract can work with uncompressed bmp files only. Hence, if ImageMagick is used to convert .pdf to .bmp, the following will work
convert -density 300 -quality 100 mypdf.pdf BMP3:mypdf.bmp
tesseract mypdf.bmp -psm 6 mypdf txt
In tessaract v5 3.0+
Pytessaract does not convert images to BMP. You can verify this by commenting out cleanup(f.name) in the save context manager, which is found within the source code /pytesseract/pytesseract.py. The filename of the temp file will also need to retrieved (Pytessaract was saving files within temp files directory of the user, ie. "[path-to-user]\AppData\Local[file-name]". I found what Pytesseract is actually doing is in the prepare function.
Basically, taking the temp file and using that same file with the tesseract command directly will yeild the same results

Tesseract Recognition of dates

Date
Hello, I'm trying to use Tesseract to recognize dates from receipts. This code works well to extract the total on the receipt but doesn't seem to work for the dates as it prints out empty.
What am I missing here to get it to work?
Here is my code:
from PIL import Image
import pytesseract
img = Image.open('Rec.jpg')
print(pytesseract.image_to_string(img, config='-psm 6'))
Tried to use tesseract command line with psm setting to 12 and got the correct date 08/21/2017.
--psm 12 means to set segmentation mode for Sparse text with OSD.
You can use command tesseract --help to find out the --psm supported by tesseract v4.00.00alpha which is used in test.
Hope this could help.

IBM Watson Visual recognition{"code":400,"error":"Cannot execute learning task. : no classifier name given"}

When I try to train a classifier with two positive classes and with the API key (each class contains around 1200 images) in Watson Visual Recognition, it returns that "no classifier name is given" - but that I have already provided. This is the code:
$ curl -X POST -F "blank_positive_examples=#C:\Users\rahansen\Desktop\Altmuligt\training\no_ocd\no_ocd.zip" -F "OCD_positive_examples=#C:\Users\rahansen\Desktop\Altmuligt\training\ocd\ocd.zip" -F "name=disease" "https://gateway-a.watsonplatform.net/visual-recognition/api/v3/classifiers?api_key={X}&version=2016-05-20"
{"code":400,"error":"Cannot execute learning task. : no classifier name given"}
What I have done so far:
Removed all special characters in the file names as I thought that might be the problem:
Tried to give other names for the classifeir, e.g. "name=ocd"
I also tried to train it on a smaller dataset, like 40 images in each positive class and then it actually works fine. So maybe the size of the dataset is the problem. However, according to Watson training guidelines, I comply with the size regulations: https://www.ibm.com/watson/developercloud/doc/visual-recognition/customizing.html I have a free subscription.
Do anyone has any recommendations for how to solve this classifier training problem?
This can occur when there's a problem processing the zip files. I would try simplifying your training files. For instance, use just 100 examples for class, then you can add more via retraining later. It's always good to train then measure performance and then add more training samples.
#Rasmus, you should verify the name their picture neatly, meaning no special symbols, spaces or etc. in the file name of images. It appears to be related to special characters in the input. This API expects only characters and numbers in the alphabet as classifier names. It also requires that the images in your zip files end with a file extension for images like .jpg, .jpeg, .gif or .png
So, after you rename the images, check if all have the correct formats, like .jpg, .png, and supported formats for Visual Recognition.
Replace {api-key} with the service credentials you copied in the first step.
Modify the location of the {class}_positive_examples to point to where you saved the .zip files.
And, use your cURL like:
curl -X POST
-F "blank_positive_examples=#C:\Users\rahansen\Desktop\Altmuligt\training\no_ocd\no_ocd.zip"
-F "OCD_positive_examples=#C:\Users\rahansen\Desktop\Altmuligt\training\ocd\ocd.zip"
-F "name=disease"
"https://gateway-a.watsonplatform.net/visual-recognition/api/v3/classifiers?api_key={api-key}&version=2016-05-20"
Obs. Can be other problem, see Other ask about error with classifier name.
My example working in my PC computer:
curl -X POST -F "dog_positive_examples=c:\Dogs.zip" -F "negative_examples=c:\Cats.zip" -F "name=dogs" "https://gateway-a.watsonplatform.net/visual-recognition/api/v3/classifiers?api_key={API KEY}&version=2016-05-20"
See the official reference here.

Files (that exist) not found when using Sun Grid Engine

I am using Matlab to do some image processing on a cluster that uses Sun Grid Engine. On my personal laptop the code runs fine but when I run it on the cluster I get several errors of files that cannot be found. For example a .nii (nifti) file that exists (I can read it when I run matlab interactively in the shell) is not found. An excerpt from the output log:
{^HError using load_nii_ext (line 97)
Cannot find file
"/path/imageFile.nii".
And I also get errors from an xml structured file (that needs to have a .mps extension to be readable by a postprocessing toolbox, which all worked fine on my own laptop). Another excerpt from the output log:
/path/pointSetFile.mps exists
{^HError using readpointsmps (line 24)
Failed to read XML file
/path/pointSetFile.mps.
In this second error message the first line is the output I get from including in the script,
if exist(strcat(folder, fileName), 'file') == 2
disp([strcat(folder, fileName) ' exists'])
end
So it's weird that 1) I can see the files, 2) I can open them manually with Matlab, 3) according to the matlab function exist() they do indeed exist, but when the functions xmlread() and read_niigz() want to open them they suddenly can't be found.
As extra information: I run the scripts with the flags -nodisplay -nodesktop -nosplash, and I currently run the scripts as 2 tasks with the SGE. Memory should be good, I give it 5GB and all my images combined are about 1.5GB.
I'm using absolute paths starting at the root /, have been reading the paths letter by letter about 200 times now and have no clue anymore what's going on.
I have solved the problems now.
#Xiangrui Li pointed out in the comments that the missing .nii files were due to interference with the unzipping, reading and deletion of the .nii and .nii.gz files. That was indeed the problem. Thanks!
I found that the second problem was due to umlauts in the filenames. Apparently there was a difference between how the system and matlab and even other processes involved encode the filenames. Removing the characters with umlauts solved the problem.