Tesseract 4 : is there a maximum resolution that an input image can have when page segmentation is enabled? - tesseract

My Java project deals with OCRing pdfs to index them. Each pdf page is converted into a png which is then piped to tesseract 4.
The pdf->png conversion uses renderImageWithDPI from PDFBox PdfRenderer :
buffImage = pdfRenderer.renderImageWithDPI(currentPage,
PNG_DENSITY,
ImageType.GRAY);
with PNG_DENSITY = 300 as advised on tesseract's wiki to get best results.
The OCR command is
The command used for tesseract is
tesseract input.png output -l fra --psm 1 --oem 1
I also tryed --psm 2 or 3 which also involve page segmentation ie
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
With a scanned PDF (producer/creator is Adobe Acrobat 7.0, which involves copyrighted content so I can't share it) of 146 pages, tesseract makes endless computations (the process never ends) on a given page (85).
As it was too long to test (ie : wait until page 85 gets OCRed), I decided to generate an extract of this pdf with Evince "print to file" feature.
Now the pdf produced by Evince (producer/creator is cairo 1.14.8), Tesseract handles it successfully (ie the image gets OCRed).
The difference is the image resolution. The image that fails is 4991x3508 pixels whereas the one that succeeds is only 3507x2480 pixels.
Please note : tesseract in "Sparse text with OSD" (ie --psm 12) handles the page "successfully" although the text (on 2 columns) is not understandable (ie the 2 columns are mixed)
EDIT after several trials and errors
It looks like the input image has to have a width strictly less than 4000 pixels to work with page segmentation. Looking at Tesseract source code, in a class called "pgedit" the canvas size seems limited to 4000 x 4000 as the constructor of a "ScrollView" (for whatever it serves) is :
ScrollView::ScrollView(const char* name, int x_pos, int y_pos, int x_size,
int y_size, int x_canvas_size, int y_canvas_size, bool y_axis_reversed)
So my question now is, why is there a limit of 4000 pixels wide / high to use page segmentation, and what should I do if a pdf page converted to png at 300dpi exceeds 4000 pixels (either wide or high or both) ?
Any help appreciated,

Related

Tesseract-OCR only detects words but not single characters

As the title says, I've been trying to do live recognition of a Single Character of an image taken with webcam (which contains only that character, not other words or characters), but I couldn't get any results. The machine is an RPi3 with 1GB of RAM and Rasp OS 64-bit, therefore I can't use libraries such as easyocr due to the heavy load on RAM and CPU.
What I've tried:
I've always applied a threshold to the image (using OTSU and Binary Mask) (example image containing all 3 characters I have to recognize), but no results. I also tried to pass the image to tesseract with the Page Segmentation Module changed to 10 in order to treat the image as a single character, still no results.

Tesseract mixing up "1" and "7" despite training on exact font

I am using tesseract to get text from an image, I am only interested in numbers. I have trained tesseract and created a new language that is the exact font in the image and the training data only included numbers. In the training data I also included every possible value that would be in an image, 1-5000 to be specific and also created a wordlist of these same values. However it still mixes up 1 and 7, as well as sometimes 3 and 8. Does anybody have any recommendations on whether I should retrain differently or do some processing on the image before giving it to tesseract?
Make sure there are at least 20 instances of every character in the training texts you provide to tesseract. I give at least 6 pages of the same font to have a decent training sample size.
2.Tesseract Text Recognition also depends on the image quality. Check out possible preprocessing algorithms you can use: Improve Quality of Tesseract
Take a look at the number_dawg file. Modifying it can help recognising digits.

Tesseract training with multipage tiff

How does the box file need to look like if I use a multipage tiff to train Tesseract?
More precisely: how do the Y-coordinates of a box file correspond to Y-coordinates within pages?
The last, 6th column in the box file represents zero-based page number.
https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files
Update:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
Each font should be put in a single multi-page tiff and the box file
can be modified to specify the page number for each character after
the coordinates. Thus an arbitrarily large amount of training data may
be created for any given font, allowing training for large
character-set languages.
Even if you can have as large training text as you want, it could potentially result in unnecessarily large image and hence slow down training.

Understanding webp encoder options

I'm currently experimenting with webp encoder (no wic) on windows 64 environment. My samples are 10 jpg stock photos depicting landscapes and houses, and the photos already optimized in jpegtran. I do this because my goal is to optimize the images of a whole website where the images have already been compressed with photoshop using the save for web command with various values on quality and then optimized with jpegtran.
I found out that using values smaller than -q 85 have a visual impact on the quality of the webp images. So I'm playing with values above 90 where the difference is smaller. I also concluded that I have to use -jpeg_like because without it the output is sometimes bigger in size than the original, which is not acceptable. I also use -m 6 -f 100 -strong because I really don't mind about the time the encoder needs to produce the output and trying to achieve the smoother results. I tried several values for these and concluded that -m 6 -f 100 -strong have the best output regarding quality and size.
I also tried the -preset photo avoiding any other parameter except -q but the size of the output gets bigger.
What I don't understand from https://developers.google.com/speed/webp/docs/cwebp#options are the options -sns , -segments which seem to have a great impact on the output size. Sometimes the output is bigger and sometimes smaller in size for the same options but I haven't concluded yet what is the reason for that and how to properly use them.
I also don't understand the -sharpness option which doesn't have an impact at the output size at least for me.
My approach is far less than a scientific approach and more like a trial and error method and If anybody knows how to use those options for the specific input and explain them for optimum results I would appreciate such a feedback.
-strong and -sharpness only change the strength of the filtering in the header of the compressed bitstream. They will be used at decoding time. That's why you don't see a change in file size for these.
-sns controls the choice of filtering strength and quantization values within each segments. A segment is just a group of macroblocks in the picture, that are believed to be sharing similar properties regarding complexity and compressibility. A complex photo should likely use the maximum allowed 4 segments (which is the default).

What is the maximum size of JPEG metadata?

Is there a theoretical maximum to the amount of metadata (EXIF, etc) that can be incorporated in a JPEG file? I'd like to allocate a buffer that is assured to be sufficient to hold the metadata for any JPEG image without having to parse it myself.
There is no theoretical maximum, since certain APP markers can be used multiple times (e.g. APP1 is used for both the EXIF header and also the XMP block). Also, there is nothing to prevent multiple comment blocks.
In practice the one that is much more common to result in a large header is specifically the APP2 marker being used to store the ICC color profile for the image. Since some complicated color profiles can be several megabytes, it will actually get split into many APP2 blocks (since each APP block one has a 16bit addressing limit).
Each APPN data area has a length field that is 2 bytes, so 65536 would hold the biggest one. If you are just worried about the EXIF data, it would be a bit less.
http://www.fileformat.info/format/jpeg/egff.htm
There are at most 16 different APPN markers in a single file. I don't think they can be repeated, so 16*65K should be the theoretical max.
Wikipedia states:
Exif metadata are restricted in size to 64 kB in JPEG images because according to the specification this information must be contained within a single JPEG APP1 segment.