Tesseract and tiff format - spp not in set {1,3}

Tesseract and tiff format - spp not in set {1,3} - tesseract

While trying to run this command:
tesseract bond111.tif bond111 batch.nochop makebox
I get the next error
Error in pixReadFromTiffStream: spp not in set {1,3}
Error in pixReadStreamTiff: pix not read
Error in pixReadTiff: pix not read
Assuming that spp not in set is the main error here, what does it mean?
At first it had trouble because the bpp was higher than 24 so I reduced it using Gimp but that did not resolve the issue.

It probably means your TIFF image has an alpha channel and therefore the underlying Leptonica library used by Tesseract doesn't support it. If you're using Imagemagick then be aware that operations such as -draw can cause alpha channels to be added. If you're using convert in your workflow and want to remove the channel again immediately, flatten the image before writing by adding -background white -flatten +matte before the output filename, e.g.:
convert input.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff
Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.
Sources: magick-users mailing list posting; tesseract-ocr mailing list posting

Thanks for your post ZakW, you pointed me to the right direction.
Anyhow i also needed to set '-depth 8'. Quality was not good enough for OCR, whatever I tried.
What worked for me is this solution:
ghostscript -o document.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw document.pdf
tesseract document.tiff document -l deu
vim document.txt
This way I got perfect text with Umlauts in german.

Adjusting the conversion to the following line did help me.
convert -density 300 input.pdf -depth 8 -background white -alpha Off output.tiff
Note that the other answers did not work for me since they use the deprecated +matte flag instead of -alpha Off.

You can try using the command 'tiffinfo' provided by libtiff_tools to verify the TIFF format of your src image. A number of TIFF formats exist, with different values for Bits-per-pixel (bpp) and Samples-per-pixel (spp).
Error in pixReadFromTiffStream: spp not in set {1,3,4}
An 'spp' value of 2 is invalid for TIFF.
I solved the problem by saving directly to TIFF format from Gimp, instead of converting from .png to .tif using ImageMagick's 'convert'.
See also: TIFF format

Related

tesseract cannot read clear single line

So I have this png image here
and when I try to read it with tesseract on the command line, I get some random character
❯ tesseract Selection_002.png stdout --psm 7
Warning. Invalid resolution 0 dpi. Using 70 instead.
ale PR Me)
I'm running tesseract version 4.0.0-beta.1-370-g8b64 on ubuntu.
I would have guessed that this image would be easy to read for tesseract?
I've gone through trying to resize the image and "cleaning" it up but there's no much noise to clean on that. What Am I doing wrong?

Please try inverting the image color. (Black text on white).
I tried with your image as well as inverting the color. Both gave successful results.

JPEG to PNG conversion with 300 DPI

Unable to convert a JPEG image into a 300 DPI PNG image using ImageMagick.
After conversion the PNG image is 72 DPI only. I'm using ImageMagick 6.9.0-0 Q16 x86 and Ghostscript v9.15.
Below is the line I use in my Perl script:
system("\"$imagemagick\" -set units PixelsPerInch -density 300 \"$jpg\" \"$png\"");

Adjusting the units & density will not alter the underlining image data, but updates meta info for rendering libraries. Important for vector to raster, but not very useful for raster to raster. To adjust the DPI of an image, use the -resample operation.
convert source.jpg -resample 300 out.png
You verify the DPI resolution with the following...
identify -format "%[resolution.x] %[resolution.y]\n" out.png

I'm wondering where the 72dpi is coming from. Assuming you are using X and some kind of Unix, ImageMagick defaults to using the screen resolution (72 dpi). I'm not sure what it does under OSX/XQuartz but it's likely similar. Is your screen resolution set to 72dpi (!?).
I'm with #emcconville #ikegami - just do this straight from ImageMagick on the commandline - passing the right options to be sure.
There are image manipulation modules that you can use from perl without having to resort to system commands as well such as Imager::Transformations, Image::Magick, and GD. Here's how to convert with GD.
perl -MGD -E 'my $imgjpg = GD::Image->newFromJpeg("img.jpg");
open my $imgpng, ">", "img.png" or die; print $imgpng $imgjpg->png();'
With most image manipulation packages the original resolution show be maintained during conversion - though some (including GD) will default to lower color depths (8 bit) unless passed a Truecolor flag.
e.g. GD::Image->newFromJpeg("img.jpg", 1);

Libreoffice Draw Export Resolution makes no sense

I am attempting to make a very simple label using Libreoffice Draw v 4.0.2.2. The label has not much more to it than regularly spaced lines of centered text
This image will be printed, and I have a fixed size/ppi requirement to ensure appropriate print quality.
I set the page size to my specs, and layout the text as I desire. The print shop takes several image formats including .tiff and .png. When I export the image, a dialog pops up that asks for the image size/ resolution. The given ppi is very low (~40) and I require a minimum of 180ppi. When I enter this, the image size adjusts itself and results in an image that is far too small.
The only solution that appears to be viable is to explode the page size and the drawing text size so it gets shrunk upon export. This is a very imprecise and illogical feature (bug?) of the program that I really wish is a result of my ignorance.
I found a thread in the mailing list which describes this issue exactly. The only answer that is given is essentially "yes, this is ridiculous and doesn't help anybody".
Can anyone give some advice to this? Or at least shed some light on who might need this "feature"?

There is something off about the Export tool of LibreOffice in general. It has been years since it is broken. Taking a screenshot is an alternative, but obviously you cannot control the resolution.
So, a better work around is exporting to SVG, and then convert the SVG to PNG with Inkscape. Once downloaded, convert the file with the following command:
inkscape -z -e out.png -w 1024 in.svg
If you are in Windows (x64), you will need to indicate the full path:
"C:/Program Files/Inkscape/inkscape.exe" -z -e out.png -w 1024 in.svg
If you install the 32 bit version, this should work:
"C:\Program Files (x86)/Inkscape/inkscape.exe" -z -e out.png -w 1024 in.svg

This can be done from inside Libre Office, there is no need to use any external tool. The Export dialog is very confusing, yes; you have to realize that both size and resolution can be set independently.
Select File -> Export -> choose the desired format. The export dialog should appear.
TAKE NOTE of Width and Height. Set the desired resolution; notice how Width and Height change (?). Don't worry, restore Width and Height to your saved values. And that's it. You get a high resolution image with the desired size and DPI.

Libre Draw (the one I'm using anyway) is a vector drawing app - have you asked the print shop if they can use vector formats like eps, pdf? Most should be able to in my experience. Then resolution becomes irrelevant.
-Terry

Tesseract Trained data

Am trying to extract data from reciepts and bills using Tessaract , am using tesseract 3.02 version .
am using only english data , Still the output accuracy is about 60%.
Is there any trained data available which i just replace in tessdata folder

This is the image nicky provided as a "typical example file":
Looking at it I'd clearly say: "Forget it, nicky! You cannot train Tesseract to recognize 100% of text from this type of image!"
However, you could train yourself to make better photos with your iPhone 3GS (that's the device which was used for the example pictures) from such type of receipts. Here are a few tips:
Don't use a dark background. Use white instead.
Don't let the receipt paper crumble. Straighten it out.
Don't place the receipt loosely on an uneven underground. Fix it to a flat surface:
Either place it on a white sheet of paper and put a glas platen over it.
Or use some glue and glue it flat on a white sheet of paper without any bend-up edges or corners.
Don't use a low resolution like just 640x480 pixels (as the example picture has). Use a higher one, such as 1280x960 pixels instead.
Don't use standard exposure. Set the camera to use extremely high contrast. You want the letters to be black and the white background to be really white (you don't need the grays in the picture...)
Try to make it so that any character of a 10-12 pt font uses about 24-30 pixels in height (that is, make the image to be about 300 dpi for 100% zoom).
That said, something like the following ImageMagick command will probably increase Tesseract's recognition rate by some degree:
convert \
http://i.stack.imgur.com/q3Ad4.jpg \
-colorspace gray \
-rotate 90 \
-crop 260x540+110+75 +repage \
-scale 166% \
-normalize \
-colors 32 \
out1.png
It produces the following output:
You could even add something like -threshold 30% as the last commandline option to above command to get this:
(You should play a bit with some variations to the 30% value to tweak the result... I don't have the time for this.)

Taking accurate info from a receipt is not impossible with tesseract. You will need to add image filters and some other tools such as OpenCV, NumPy ImageMagick alongside Tesseract. There was a presentation at PyCon 2013 by Franck Chastagnol where he describes how his company did it.
Here is the link:
http://pyvideo.org/video/1702/building-an-image-processing-pipeline-with-python

You can get a much cleaner post-processed image before using Tesseract to OCR the text. Try using the Background Surface Thresholding (BST) technique rather than other simple thresholding methods. You can find a white paper on the subject here.
There is an implementation of BST for OpenCV that works pretty well https://stackoverflow.com/a/22127181/3475075

i needed exactly the same thing and i tried some image optimisations to improve the output
you can find my experiment with tessaract here
https://github.com/aryansbtloe/ExperimentWithTesseract

Ghostscript converting Postscript to PNG is over-saturated

I'm trying to use Ghostscript and/or ImageMagick to convert each page of a Postscript document into PNG images. The problem is that both produce images that are way too saturated (I think that's the right terminology).
Here are the commands I'm trying:
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -dGraphicsAlphaBits=4 -sOutputFile=page_%02d.png brochure.ps
convert brochure.ps im_page_%02d.png
This is the input Postscript file (brochure.ps from above)
Here's a couple of the output images I'm getting:
Page 1
Page 6
As you can see (especially on the page with the big green map of New Hampshire), the colors of the output PNGs are too bright/saturated. How can I prevent the colors from being changed so much and get a more accurate conversion?
Preview in OS X 10.6 automatically does a very accurate conversion to PNG when you open a Postscript file in it. This leads me to believe there is just something screwy with the way ghostscript converts ps->png (I'm fairly confident ImageMagick is just a wrapper for ghostscript for this operation). Is there a tool besides ghostscript I should be using instead?
Note: As pipitas points out below, the visible difference of colors varies by OS. It is very obvious in OS X 10.6, but apparently not very noticeable in Windows XP.

You are right in assuming ImageMagick just being a wrapper for Ghostscript when converting from PostScript or PDF to an image format.
I think, this problem can only be solved to anybody's satisfaction once the efforts to add support for ICC profile handling and color management (currently underway) are completed for Ghostscript (design document as PDF). That point in time is close, however. If I understand recent commits to http://svn.ghostscript.com/trunk/ correctly, the next release (which will be dubbed 9.00 and out hopefully in August) will include support for color management via LittleCMS. Yay!

OSX 10.4 and up provide sips (scriptable image processing system) and it works well with PDF format. Perhaps it can be a temporary solution until Ghostscript supports color management.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse