Tesseract Recognition of dates - tesseract

Date
Hello, I'm trying to use Tesseract to recognize dates from receipts. This code works well to extract the total on the receipt but doesn't seem to work for the dates as it prints out empty.
What am I missing here to get it to work?
Here is my code:
from PIL import Image
import pytesseract
img = Image.open('Rec.jpg')
print(pytesseract.image_to_string(img, config='-psm 6'))

Tried to use tesseract command line with psm setting to 12 and got the correct date 08/21/2017.
--psm 12 means to set segmentation mode for Sparse text with OSD.
You can use command tesseract --help to find out the --psm supported by tesseract v4.00.00alpha which is used in test.
Hope this could help.

Related

Does Tesseract correct spelling (UB Mannheim, Windows installation)?

I'm using Tesseract to perform ocr tasks.
Can someone confirm if tesseract has an LSTM module for auto-post-ocr spell correction? And if yes, what did I do wrong so it's not triggered?
It appears it does have this module (e.g. see slide 4 in first link):
https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/2ArchitectureAndDataStructures.pdf
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
However, many example outputs give me misspelled words such as ambuiatory/ambutatory for ambulatory, and sometimes informat for information. This seems that tesseract picked the characters with highest likelihood, assembled them, but didn't perform any post-ocr processing.
Here's how I'm using tesseract
I followed the UB Mannheim installation:
https://github.com/UB-Mannheim/tesseract/wiki
and pytesseract through pip install:
https://pypi.org/project/pytesseract/
Here's my sample codes (sorry my image is internal so cannot be posted):
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\username\AppData\Local\Tesseract-OCR\tesseract'
pth_test = r'C:\Users\username\image_path_continues'
print( pytesseract.image_to_string(pth_test, lang = 'eng') )

pytesseract results different from tesseract command line results

I am trying to convert a scanned page to text using both pytesseract and tesseract command line on Ubuntu. The results are remarkably different (pytesseract performs way better than tesseract command line) and I am unable to understand why. I looked at the default values for the parameters and tried altering some of the parameter values in tesseract command line (like psm ) but I am unable to get the same result as pytesseract. Due to lack of proper documentation in pytesseract I am not able to figure out what default values for parameters are used.
Here is my pytesseract code
print(pytesseract.image_to_string(Image.open('test.tiff'))
Looking at the source code of pytesseract, it seems the image is always converted into a .bmp file.
Working with a .bmp file and psm of 6 at the command line with Tesseract gives same result as pytesseract.
Also, tesseract can work with uncompressed bmp files only. Hence, if ImageMagick is used to convert .pdf to .bmp, the following will work
convert -density 300 -quality 100 mypdf.pdf BMP3:mypdf.bmp
tesseract mypdf.bmp -psm 6 mypdf txt
In tessaract v5 3.0+
Pytessaract does not convert images to BMP. You can verify this by commenting out cleanup(f.name) in the save context manager, which is found within the source code /pytesseract/pytesseract.py. The filename of the temp file will also need to retrieved (Pytessaract was saving files within temp files directory of the user, ie. "[path-to-user]\AppData\Local[file-name]". I found what Pytesseract is actually doing is in the prepare function.
Basically, taking the temp file and using that same file with the tesseract command directly will yeild the same results

tesseract output is in single line instead of multiple lines

i tried to use tesseract for ocr and the recognation is fine.
i want to recognize adresses from letter. when i read it in the following happens:
input:
Name Name
Street
Code City
output:
Name NameStreetCode City
i tried all -psm variaties with no effect. after googling i think -psm 4 would be the right one, but i get an error:
`set_count == gridheight():Error:Assert failed:in file ..\..\textord\colfind.cpp, on line 648`
This effect comes only on windows - on my macbook the lines are correct.
can anybody help me?
Use Unix2dos to convert the file into the correct format.

ImageMagick crop with row/column in file name only saving last image

I'm attempting to crop an image using ImageMagick and via PowerShell. I can crop the image fine with the following command, and it creates the 2000+ images:
convert -crop 16x16 .\original.png tileOut%d.png
However, I would like to take advantage of ImageMagick's ability to dynamically set the file name.
According to a post on their forums I should be able to run something like the following via a batch file:
convert ^
bigimage.jpg ^
-crop 256x256 ^
-set filename:tile "%%[fx:page.x/256+1]_%%[fx:page.y/256+1]" ^
+repage +adjoin ^
tiled_%%[filename:tile].gif
I shouldn't need to escape the % since I'm running this in PowerShell directly, so I used the following:
convert -crop 16x16 .\original.png -set filename:tile "%[fx:page.x/16+1]_%[fx:page.y/16+1]" +repage +adjoin directory\tiled_%[filename:tile].png
However, when I run this command I end up with one file called tiled_%[filename and another called tiled_45_47.png.
So while it does seem to create the last file, it only creates the one. The first file is 0 bytes in size, but takes up over 8 MB of space on disc, according to properties on the file.
Trying to run the command in a batch file results in the same behavior, which makes me think PowerShell itself isn't the issue, but rather the command is.
According to the documentation +adjoin is required since I want different images. +repage doesn't make much sense to me, but I've kept it in the command since the original had it, and excluding it doesn't seem to change the output. -set filename seems pretty straightforward.
Large size of the first leads me to believe that all the previous images might be getting added to it. However, the file name also suggests it's getting hung up on the :, but it doesn't appear to be a special character in PowerShell. It's also creating an image for the very last crop. Baffling.
So what am I doing wrong?
Thanks in advance!
EDIT:
PowerShell 5.0.10586.0, on Windows 10.
ImageMagick 6.9.2 Q16 (64-bit)
From the comments, I'm thinking the issue might be with the ImageMagick command.
I'm not using Powershell, but I think you will have more success by specifying your image first, then the crop, then setting the filename:
convert original.png -crop 16x16 -set filename:tile "%[fx:page.x/16+1]_%[fx:page.y/16+1]" +repage "tiled_%[filename:tile].png"
So in the past I was using the following command to crop images, with the %d being automatically converted to a number based upon the sequence.
convert -crop 16x16 .\original.png directory\tileOut%d.png
That works perfectly fine. However, the example provided on that forum had the original file name listed as the first argument to the convert command. Changing my command so that it was listed first results in the expected behavior.
convert .\original.png -crop '16x16' -set 'filename:tile' '%[fx:page.x/16+1]_%[fx:page.y/16+1]' +repage +adjoin 'directory\tiled_%[filename:tile].png'
The use of single quotes in so many locations may not be required, but it works.

Forcing Tesseract to give some answer

I am trying to recognize one line of handwritten digits. Currently I do some preprocessing with Python and OpenCV, split the image into connected components and feed these components to Tesseract with PSM=10 (page segmentation mode, 10 is "treat the image like a single character") and character whitelist restricted to "0123456789". I expect Tesseract to return garbage where my connected component segmentation fails and to return exactly one digit when my segmentation succeeds. Tesseract often returns nothing at all.
I have tried both pytesseract and python-tesseract as a Tesseract interface for Python. Pytesseract works by locating the executable tesseract.exe, running it with suitable parameters from the shell and collecting the answer. This is how I found out about my problem. After that, I tried python-tesseract, which implements a full-blown C API. Naturally, the result was the same.
Below is a sample of 5 images I fed into Tesseract separately (I've also uploaded the same images as separate files here):
I get 1,*,4,*,* on these images, * meaning that Tesseract returned only whitespace.
With other page segmentation modes, I get the following:
PSM_SINGLE_CHAR: 1*4**
PSM_SINGLE_BLOCK_VERT_TEXT: **43*
PSM_CIRCLE_WORD: 11***
PSM_SINGLE_LINE: 11491
PSM_AUTO: *****
PSM_SPARSE_TEXT: *****
PSM_SINGLE_WORD: 11499
PSM_AUTO_ONLY: *****
PSM_SINGLE_COLUMN: *****
PSM_SPARSE_TEXT_OS: *****
PSM_SINGLE_BLOCK: 11499
PSM_OSD_ONLY: *****
PSM_AUTO_OSD: *****
PSM_COUNT: 11499
Weirdly, when I run tesseract image.png image -l eng -psm 10 digits-only against these images, it returns *,*,4,9,*. (digits-only is tessedit_char_whitelist 0123456789)
How do I force Tesseract to give me some answer instead of nothing at all?