PDF to tiff ImageMagick problem - command-line

I'm trying to convert pdfs to tiff images for following OCR. I use "-density 300x300 -depth 8" as parameters.
The first problem is that from 500 KB pdf file i get 72 MB tiff file.
The second problem is bad quality of resulting image causing OCR failing.
Here you can see it yourself.
Adobe acrobat reader generated (printed) tiff image:
ImageMaggick tiff image:
The difference is huge.
How can i get as good as Adobe generated image using ImageMaggick?
Not tiff neccesary, other formats also will be good.
UPD: i've found 'antialias' option. Now it's much more better.
But still OCR result not so accurate as for Adobe version.

My suggestion is: use a Ghostscript commandline. Because ImageMagick uses Ghostscript anyway, in the background (the technical IM term for this is: Ghostscript is a "delegate" for some of the conversions, such as PDF->TIFF).
Here is a commandline that should work well for letter-sized pages of a multi-page PDF file:
gswin32c.exe ^
-o page_%03d.tif ^
-sDEVICE=tiffg4 ^
-r720x720 ^
-g6120x7920 ^
input.pdf
The -g... parameter controls the absolute width+height of the output pages using 'device points'... (and with 6120x7920 at 720dpi this happens to be letter-sized).
These TIFF pages...
...will be black+white,
...will have a resolution of 720dpi,
...will be G4-compressed and
...will be much smaller than your un-compressed 300dpi from the IM commandline
Your IM parameter of -depth 8 isn't suited to give good results from the p.o.v. of later OCR, since it will create shades of gray around letters which don't help with this.
Your OCR results should now be much better than before.
If your OCR can't handle TIFF G4 format (which I doubt), then you could generate other TIFF subformats with the help of Ghostscript. For example:
gswin32c.exe ^
-o page_%03d.tif ^
-sDEVICE=tiffgray ^
-r720x720 ^
-g6120x7920 ^
-sCompression=lzw ^
input.pdf
.
gswin32c.exe ^
-o page_%03d.tif ^
-sDEVICE=tiff24nc ^
-r720x720 ^
-g6120x7920 ^
-sCompression=lzw ^
input.pdf
The tiffgray device creates 8-bit gray output. The tiff24nc device creates 8-bit RGB color output. Both types of TIFF will of course be bigger than the tiffg4 output.

For european paper format A4 and unix/linux use:
gs -o output.tif -sDEVICE=tiffg4 -r720x720 -sPAPERSIZE=a4 input.pdf

Related

FastText quantize documentation incorrect?

I'm unable to run FastText quantization as shown in the documentation. Specifically, as shown at the bottom of the cheat sheet page:
https://fasttext.cc/docs/en/cheatsheet.html
When I attempt to run quantization on my trained model "model.bin":
./fasttext quantize -output model
the following error is printed to the shell:
Empty input or output path.
I've reproduced this problem with builds from the latest code (September 14 2018) and older code (June 21 2018). Since the documented command syntax isn't working, I tried adding an input argument:
./fasttext quantize -input [file] -output model
where [file] is either my training data or trained model. Unfortunately both tries resulted in a segmentation fault with no error message from FastText.
What is the correct command syntax to quantize a FastText model? Also, is it possible to both train and quantize a model in a single run of FastText?
Solution in Python:
# Quantize the model with retraining
model.quantize(input=train_data, qnorm=True, retrain=True, cutoff=200000)
# Save quantized model
model.save_model("model_quantized.bin")
I tried this one worked:
./fasttext quantize -input <training set> -output <model name (no suffix) -[options]
This is the example that is included in the quantization-example.sh
./fasttext quantize -output "${RESULTDIR}/dbpedia" -input "${DATADIR}/dbpedia.train" -qnorm -retrain -epoch 1 -cuto$

How to convert PPM images to JPG in Matlab?

I have some PPM images (stereo) that I read with imread() and I want to save the same images in JPEG with different Quality factors.
Here is my code.
%Read PPM image
L = imread(filename_L);
%Create JPEG Q85 from PPM
filename_L85 = strcat(filename_L,'_ppm_to_jpeg.jpg');
imwrite(L,filename_L85,'JPEG','Quality',85);
And here the error I get.
Error using imwrite>parse_inputs (line 528)
The colormap should have three columns.
Error in imwrite (line 418)
[data, map, filename, format, paramPairs] = parse_inputs(varargin{:});
Error in testFinale (line 75)
imwrite(L,filename_L85,'JPEG','Quality',85);
How can I write JPEG images previously read in PPM format?
Thanks
Could it be that is just has to do with your case of 'JPEG', the documentation of imwrite specifies parameters for file type as lowercase.
Apart from that you might not even need it as the file type is derived from the extension which in this case is set explicitly to .jpg already.
So you might either go for:
imwrite(L,filename_L85,'jpeg','Quality',85);
or perhaps even easier:
imwrite(L,filename_L85,'Quality',85);

Error: The compressed pixel data is missing item delimiters.?

I am working with few Dicom files and when i try to use dicomread('filename.dcm') in MATLAB it gives the following error:
Error using dicomread>processOffsetTable (line 943)
The compressed pixel data is missing item delimiters.
Error in dicomread>processEncapsulatedPixels (line 858)
[offsetTable, offset] = processOffsetTable(metadata);
Error in dicomread>newDicomread (line 232)
X = processEncapsulatedPixels(metadata, frames);
Error in dicomread (line 86)
[X, map, alpha, overlays] = newDicomread(msgname, frames, useVRHeuristic);
I can view this same file in dicom viewing Softwares like onis, di com viewer, Sante Dicom etc.., but when i use dicomread I cannot see see them and get this error
I have so many images of this same format and cannot start from the beginning again, Is there any way I can use this file and view it.
Refer this online help.
It is common in DICOM world that not all data sets fully comply with DICOM. Most applications (you mentioned in your question) handle the non-compliant part with assumptions and workarounds based on experience and imagination.
Try setting TF to false to read these files.
Also note the list of supported transfer syntax:
Little-endian, implicit VR, uncompressed
Little-endian, explicit VR, uncompressed
Big-endian, explicit VR, uncompressed
JPEG (lossy or lossless)
JPEG2000 (lossy or lossless)
Run-length Encoding (RLE)
GE implicit VR, LE with uncompressed BE pixels (1.2.840.113619.5.2)
Check your input image is compressed with one of the above.

How to train a specific font

I'd like to use tesseract to recognize digits on gas meters. I have an image:
If I try to recognize this image using "tesseract gas.png output", it gives me "Empty page!!" and the output is empty. I started with training using this tutorial: https://blog.cedric.ws/how-to-train-tesseract-301
I tried to take one letter and train tesseract with this letter:
I tried this command "tesseract eng.matrx60x40.exp0.png output -psm 10" which worked ("2" in output.txt). I followed the tutorial and I get final eng.traineddata.
eng.traineddata
If I now try to use this traineddata with command "tesseract gas.png output" on the original image, I get "Empty page!!".
Am I doing anything wrong, or it is not possible to train the tesseract letter by letter?

cannot import tif file into matlab

i am trying to import a .tif image into matlab with the following code
>> aa = imread('house.tif');
i get the error
Error using rtifc
TIFF library error: '_TIFFVSetField: C:\Users\user\Documents\MATLAB\house.tif: Null count
for "Tag 34022" (type 1, writecount -3, passcount 1).'.
Error in readtif (line 49)
[X, map, details] = rtifc(args);
Error in imread (line 434)
[X, map] = feval(fmt_s.read, filename, extraArgs{:});
as i am using matlab for the first time in my life i really have no idea what this error means. Please help is required in this matter.
MATLAB R2012b has a bug and it cannot read TIFF files properly. More information can be found here: http://www.mathworks.com/matlabcentral/newsreader/view_thread/326232
Probably Matlab does not support the specific type of tif. In Matlab's defence, tif is not an easy file-format to read. It supports plenty of compression schemes, multiple pages and who knows what. I'd convert the tif to png and go with that.
Update: A quick Google search revealed that "rtifc" is a Matlab mex-wrapper around libtiff. Your error appears to come from libtiff. If the latter can't read it, your tif will probably be problematic for a lot of other applications too.
Another thing you could try is use the implementation tiffread from François Nedelec's group at EMBL. http://www.embl.de/ExternalInfo/nedelec/misc/matlab/tiffread29.m. It's heavily used by biology folks all over the world. I've been using it for many years.