Tesseract Trained data - tesseract

Am trying to extract data from reciepts and bills using Tessaract , am using tesseract 3.02 version .
am using only english data , Still the output accuracy is about 60%.
Is there any trained data available which i just replace in tessdata folder

This is the image nicky provided as a "typical example file":
Looking at it I'd clearly say: "Forget it, nicky! You cannot train Tesseract to recognize 100% of text from this type of image!"
However, you could train yourself to make better photos with your iPhone 3GS (that's the device which was used for the example pictures) from such type of receipts. Here are a few tips:
Don't use a dark background. Use white instead.
Don't let the receipt paper crumble. Straighten it out.
Don't place the receipt loosely on an uneven underground. Fix it to a flat surface:
Either place it on a white sheet of paper and put a glas platen over it.
Or use some glue and glue it flat on a white sheet of paper without any bend-up edges or corners.
Don't use a low resolution like just 640x480 pixels (as the example picture has). Use a higher one, such as 1280x960 pixels instead.
Don't use standard exposure. Set the camera to use extremely high contrast. You want the letters to be black and the white background to be really white (you don't need the grays in the picture...)
Try to make it so that any character of a 10-12 pt font uses about 24-30 pixels in height (that is, make the image to be about 300 dpi for 100% zoom).
That said, something like the following ImageMagick command will probably increase Tesseract's recognition rate by some degree:
convert \
http://i.stack.imgur.com/q3Ad4.jpg \
-colorspace gray \
-rotate 90 \
-crop 260x540+110+75 +repage \
-scale 166% \
-normalize \
-colors 32 \
out1.png
It produces the following output:
You could even add something like -threshold 30% as the last commandline option to above command to get this:
(You should play a bit with some variations to the 30% value to tweak the result... I don't have the time for this.)

Taking accurate info from a receipt is not impossible with tesseract. You will need to add image filters and some other tools such as OpenCV, NumPy ImageMagick alongside Tesseract. There was a presentation at PyCon 2013 by Franck Chastagnol where he describes how his company did it.
Here is the link:
http://pyvideo.org/video/1702/building-an-image-processing-pipeline-with-python

You can get a much cleaner post-processed image before using Tesseract to OCR the text. Try using the Background Surface Thresholding (BST) technique rather than other simple thresholding methods. You can find a white paper on the subject here.
There is an implementation of BST for OpenCV that works pretty well https://stackoverflow.com/a/22127181/3475075

i needed exactly the same thing and i tried some image optimisations to improve the output
you can find my experiment with tessaract here
https://github.com/aryansbtloe/ExperimentWithTesseract

Related

Alter a person's race from the command line.

A disability project I'm building (here for your interest). Uses a set of copyright-free icons. Here are some examples of them in use.
It's bothering us that all of the icons are of white people. We'd like to make them more diverse and representative.
The icons are pngs. Here's an example. I feel like there must be a command line way of replacing one colour with another. If I had that, then I could do a lot of things semi-automatically, but I have NO idea if such a thing is possible.
Is it?
You can replace a specific color with another using ImageMagick or GraphicsMaqick. In your screenshot, the "flesh" is color #FFEFC6. Use -fill newcolor -opaque oldcolor to replace this color with another (and use "-fuzz 10%" to replace all pixels whose color is "close" to oldcolor):
magick pdzSe.png -fill black -fuzz 10% -opaque "#FFEFC6" black.png
magick pdzSe.png -fill orange -fuzz 10% -opaque "#FFEFC6" orange.png
magick pdzSe.png -fill brown -fuzz 10% -opaque "#FFEFC6" brown.png
If you want to use GraphicsMagick, replace magick with gm convert, and if you want to use ImageMagick version 6, replace magick with convert. If you're running on Windows, you'll need to replace the % with %%.
This should be enough to get you started, but it's a crude solution that doesn't account for blended pixels such as those appearing in your sample "thumbs-up" icon. Below are the results of running the same commands against "thumbs-up", enlarged 4x so you can see the "halo" effect around the edges of the "flesh". Getting rid of the "halos" will require more work, specific to your icon set. Ideally you'd have the original vector art-work to update, but I suppose that's not the case.
In fact, the original artwork is available at straight-street.com (search for "good") and working with the original SVG files is much simpler; there is no "halo" issue. Here I just used a text editor to change the color FFEFC6 to FF8888:

Connect pixels in matlab

Please suggest how to connect the dotted pixels in an image like below:
Original Image
I want to apply OCR on this image. I have tried some morphological operations such as thickening and bridging but not obtaining the correct output as expected (NH5343320).
The original image is also uploaded. On applying horizontal edge detection on the original image, I got the dotted image as above. Is there any another methods available for applying OCR in these kind of images.
I would crop out and fill in a template for each of the available letters. Presumably, that would be the letters [A-Z] and the digits [0-9] like this.
0.png
3.png
Now I would do a sub-image search for each of them in your original image. I am doing this at the command-line with ImageMagick but you could use Matlab, OpenCV, or CImg or the Python, Perl, PHP, C, C++ bindings of ImageMagick.
So, I look for the 3 first:
compare -metric rmse -dissimilarity-threshold 1 -subimage-search plate.png 3.png result.png
25607.9 (0.390752) # 498,46
So, the 3 is found at coordinates 498,46. There will be 2 output files, output-0.png which looks like this:
and output-1.png in which you can see the brightest areas showing where the match is best:
Likewise with the 0:
compare -metric rmse -dissimilarity-threshold 1 -subimage-search plate.png 0.png result.png
31452.6 (0.479936) # 664,44

ImageMagick: Looking for a fast way to blur an image

I am searching for a faster way to blur an image than to use the GaussianBlur.
The solution I am looking for can be a command line solution, but I prefer code in perl notation.
Actually, we use the Perl image magick API to blur images:
# $image is our Perl object holding a imagemagick perl image
# level is a natural number between 1 and 10
$image->GaussianBlur('x' . $level);
This works fine, but with the level height the amount of time it consumes seems to grow exponentially.
Question: How can I improve the time used for the bluring operation?
Is there another faster approach to blur images?
I found that the suggested method of resizing image for blur imitation makes output look very pixelated for very large values of sigma like 25 or more. So I finally came to an idea of downscale-blur-enlarge, which makes very nice result (almost indistinguishable from simple blur with large sigma):
# plain slow blur
convert -blur 0x25 sample.jpg blurred_slow.jpg
# much faster
convert -scale 10% -blur 0x2.5 -resize 1000% sample.jpg blurred_fast.jpg
On my i5 2.7Ghz it shows up to 10x speed up.
The documentation speaks of the difference between Blur and GaussianBlur.
There has been some confusion as to which operator, "-blur" or the
"-gaussian-blur" is better for blurring images. First of all "-blur"
is faster, but it does this using two stage technique. First in one
axis, then in the other. The "-gaussian-blur" operator on the other
hand is more mathematically correct as it blurs in all directions
simultaneously. The speed cost between the two can be enormous, by a
factor of 10 or more, depending on the amount of bluring involved.
[...]
In summary, the two operators are slightly different, but only
minimally. As "-blur" is much faster, use it. I do in just about all
the examples involving blurring. Large
That would simply be:
$image->Blur( 'x' . $level );
But the Perl ImageMagick documentation has the same text on both Blur and GaussianBlur (emphasis mine). I can't try now, you would have to benchmark it yourself.
Blur: reduce image noise and reduce detail levels with a Gaussian operator of the given radius and standard deviation (sigma).
GaussianBlur: reduce image noise and reduce detail levels with a Gaussian operator of the given radius and standard deviation (sigma).
An alternative that the documentation also lists is resizing the image to be very tiny, and then enlarging again.
Using large sigma values for image bluring is very slow. But onw
technique can be used to speed up this process. This however is only a
rough method and could use some mathematicaly rigor to improve
results. Essentually the reason large blurs are slow is because you
need a large window or 'kernel' to merge lots of pixels together, for
each and every pixel in the image. However resize (making image
smaller) does the same thing but generates fewer pixels in the
process. The technique is basically shrink the image, then enlarge it
again to generate the heavilly blured result. The Gaussian Filter is
especially useful for this as you can directly specify a Gaussian
Sigma define.
The example command line code is this:
convert rose: -blur 0x5 rose_blur_5.png
convert rose: -filter Gaussian -resize 50% \
-define filter:sigma=2.5 -resize 200% rose_resize_5.png
Not sure if I could still help OP with this, but I recently tried the same for a blurred screenlock picture.
I found that omitting the -blur part saves even more calculation time and is still delivering great results for a 4K picture:
convert in.png -scale 2.5% -resize 4000% out.png
# real: 0.174s user: 0.144s size: 1.2MiB
convert in.png -scale 10% -blur 0x2.5 -resize 1000% out.png
# real: 0.136s user: 2.117s size: 1.2MiB
convert in.png -blur 0x25 out.png
# real: 2.425s user: 21.408s size: 1KiB
However, you couldn't go lower than 2.5% with 3840x2160. It will resize the image. I guess the eps value differs for pictures of other sizes.
It should be noted, that the resulting image sizes differ noticably!

Libreoffice Draw Export Resolution makes no sense

I am attempting to make a very simple label using Libreoffice Draw v 4.0.2.2. The label has not much more to it than regularly spaced lines of centered text
This image will be printed, and I have a fixed size/ppi requirement to ensure appropriate print quality.
I set the page size to my specs, and layout the text as I desire. The print shop takes several image formats including .tiff and .png. When I export the image, a dialog pops up that asks for the image size/ resolution. The given ppi is very low (~40) and I require a minimum of 180ppi. When I enter this, the image size adjusts itself and results in an image that is far too small.
The only solution that appears to be viable is to explode the page size and the drawing text size so it gets shrunk upon export. This is a very imprecise and illogical feature (bug?) of the program that I really wish is a result of my ignorance.
I found a thread in the mailing list which describes this issue exactly. The only answer that is given is essentially "yes, this is ridiculous and doesn't help anybody".
Can anyone give some advice to this? Or at least shed some light on who might need this "feature"?
There is something off about the Export tool of LibreOffice in general. It has been years since it is broken. Taking a screenshot is an alternative, but obviously you cannot control the resolution.
So, a better work around is exporting to SVG, and then convert the SVG to PNG with Inkscape. Once downloaded, convert the file with the following command:
inkscape -z -e out.png -w 1024 in.svg
If you are in Windows (x64), you will need to indicate the full path:
"C:/Program Files/Inkscape/inkscape.exe" -z -e out.png -w 1024 in.svg
If you install the 32 bit version, this should work:
"C:\Program Files (x86)/Inkscape/inkscape.exe" -z -e out.png -w 1024 in.svg
This can be done from inside Libre Office, there is no need to use any external tool. The Export dialog is very confusing, yes; you have to realize that both size and resolution can be set independently.
Select File -> Export -> choose the desired format. The export dialog should appear.
TAKE NOTE of Width and Height. Set the desired resolution; notice how Width and Height change (?). Don't worry, restore Width and Height to your saved values. And that's it. You get a high resolution image with the desired size and DPI.
Libre Draw (the one I'm using anyway) is a vector drawing app - have you asked the print shop if they can use vector formats like eps, pdf? Most should be able to in my experience. Then resolution becomes irrelevant.
-Terry

Tesseract and tiff format - spp not in set {1,3}

While trying to run this command:
tesseract bond111.tif bond111 batch.nochop makebox
I get the next error
Error in pixReadFromTiffStream: spp not in set {1,3}
Error in pixReadStreamTiff: pix not read
Error in pixReadTiff: pix not read
Assuming that spp not in set is the main error here, what does it mean?
At first it had trouble because the bpp was higher than 24 so I reduced it using Gimp but that did not resolve the issue.
It probably means your TIFF image has an alpha channel and therefore the underlying Leptonica library used by Tesseract doesn't support it. If you're using Imagemagick then be aware that operations such as -draw can cause alpha channels to be added. If you're using convert in your workflow and want to remove the channel again immediately, flatten the image before writing by adding -background white -flatten +matte before the output filename, e.g.:
convert input.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff
Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.
Sources: magick-users mailing list posting; tesseract-ocr mailing list posting
Thanks for your post ZakW, you pointed me to the right direction.
Anyhow i also needed to set '-depth 8'. Quality was not good enough for OCR, whatever I tried.
What worked for me is this solution:
ghostscript -o document.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw document.pdf
tesseract document.tiff document -l deu
vim document.txt
This way I got perfect text with Umlauts in german.
Adjusting the conversion to the following line did help me.
convert -density 300 input.pdf -depth 8 -background white -alpha Off output.tiff
Note that the other answers did not work for me since they use the deprecated +matte flag instead of -alpha Off.
You can try using the command 'tiffinfo' provided by libtiff_tools to verify the TIFF format of your src image. A number of TIFF formats exist, with different values for Bits-per-pixel (bpp) and Samples-per-pixel (spp).
Error in pixReadFromTiffStream: spp not in set {1,3,4}
An 'spp' value of 2 is invalid for TIFF.
I solved the problem by saving directly to TIFF format from Gimp, instead of converting from .png to .tif using ImageMagick's 'convert'.
See also: TIFF format