tesseract lower size of pdf output file - tesseract

After the scanned images is there an option to output low resolution pdf images and text
The images in the pdf are so huge that the size of the pdf goes upto 1 gb.
using cmd like :
tesseract testing/eurotext.png testing/eurotext-eng -l eng pdf

Tesseract use provided image(s) for creating pdf without its modification => if your input image size is big => pdf will be big.
So you can:
Decrease size of input image (e.g. use tiff with g4, resize image...)
Use tesseract to produce hocr file and create pdf with some other tool like hocr2pdf, hocr-pdf...)
Use some pdf compression tool (there are online tools and offline like pdfsizeopt

Related

MATLAB: Create a high resolution PDF images

I am trying to convert a high resolution image (30in width x 60in height) to a pdf file in MATLAB. I tried print, exportgraphics, and couple scripts online but I keep getting low quality output. I also tried setting the resolution to 300dpi but it didnt work. Please if you have any suggestions, share with me and I will test. Many thanks!
Image file used (renamed to map.png): https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Political_map_of_the_World_%28January_2015%29.svg/9444px-Political_map_of_the_World_%28January_2015%29.svg.png
MATLAB commands used:
world=imread('map.png');
imshow(world)
exportgraphics(gcf,'world.pdf','ContentType','vector','Resolution',300)
#Texts in picture is blurry
print -dpdf 'world.pdf'
#Texts in picture is still blurry
exportfig(gcf, 'world.pdf', 'format','pdf','Resolution', 300,'Renderer', 'painters');
#this is a script from the MATLAB file exchange. Texts still blurry
I managed to do it by importing pdfbox (java) and importing the image as a bufferedimage then creating a document with pdmodel.PDDocument then adding a page with a custom size using the bufferedimage.getWidth and same for length then I streamed the bufferedimage to the page and saved the document to a pdf file. The code is on my work PC if anyone is interested I will copy it here.

Image compression using Lossy Compression technique

I need to convert PNG to JPEG,JPEG 2000 using ImageMagick and Matlab. I want to compress all data with ratio ( e.g. 10) and then specify some file size? Any idea or solution to achieve the specific file size? How can I do it? Thanks
Imagemagick can create a JPG of your desired file size. See http://www.imagemagick.org/Usage/formats/#jpg
-define jpeg:extent={size}
As of IM v6.5.8-2 you can specify a maximum output filesize for the JPEG image. The size is specified with a suffix. For example "400kb".

Convert from TIFF to PNG using Windows?

Is there a way to convert all TIFF images to PNG using windows console or any simple tool.
I renamed tags, but the problem now is file size. What are ways to compress files?
imagemagick, it's CLI tool for image manipulation available for most major operating systems including Windows http://www.imagemagick.org/script/download.php
It's very simple to use it
convert in.tiff out.png
To convert and scale by 50%:
convert in.tiff -resize 50% out.png
Here you can find full list of general commands
TinyPng is great to compress png files, you can try that.
www.tinypng.com

itext modifies TIFF images when creating PDF.

I know how to create PDF from TIFFs. My question is:
How can itext just embed original TIFFs without modifying them?
I used document.add(img) (where img is the TIFF) to create a PDF. However, the TIFF was modified to smaller size. In this case, my original uncompressed b/w TIFF file size of 2.8 MB was compressed to CCITT Group 4 TIFFs.
Does itext have a way not to modify TIFF?
Please consult ISO-32000-1. If you read this standard closely, you'll find references to TIFF in the context of LZW and Flate filters, but you'll discover that TIFF is not one of the available filters in PDF. Table 6 shows the options:
As TIFF is not supported in PDF, iText has no other option than to convert it into a format that is accepted. In your case CCITTFaxDecode.
If you really want to keep the TIFF as-is, you need to add it as an attachment. That's explained in my answer to this question: Attaching files to a PDF

Convert scanned pdf to .txt files using tesseract

I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?
Use Imagemagick:
convert -density 600 input.pdf output.tif
Density is in DPI, from my experience 600 DPI works the best.