I would like tesseract to recognize numbers on the attached image:
Tesseract is able to recognize when the number starts with 7, but once there is an 8, it fails.
I am using something like this:
tesseract image.png output --oem 3 --psm 11 -c tessedit_char_whitelist=0123456789
I cycled through all options of oem and psm (1..20), but none was good. Am I missing something here to make it work?
I converted the image to b/w and inverted colors. Then tesseract started getting better results.
Related
I'm new to Tesseract and investigating how it works.
But in some cases it failes to recognise the simpliest text ("0")
I've checked the processed image and it looks pretty clear to me.
Any suggestions what might be wrong?
Source image: and tessinput.tif:
$ ./tesseract.exe /c/dev/git/fifa/proclubs-stats/assist.png stdout conf.txt
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 450
Empty page!!
Estimating resolution as 450
Empty page!!
My Java project deals with OCRing pdfs to index them. Each pdf page is converted into a png which is then piped to tesseract 4.
The pdf->png conversion uses renderImageWithDPI from PDFBox PdfRenderer :
buffImage = pdfRenderer.renderImageWithDPI(currentPage,
PNG_DENSITY,
ImageType.GRAY);
with PNG_DENSITY = 300 as advised on tesseract's wiki to get best results.
The OCR command is
The command used for tesseract is
tesseract input.png output -l fra --psm 1 --oem 1
I also tryed --psm 2 or 3 which also involve page segmentation ie
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
With a scanned PDF (producer/creator is Adobe Acrobat 7.0, which involves copyrighted content so I can't share it) of 146 pages, tesseract makes endless computations (the process never ends) on a given page (85).
As it was too long to test (ie : wait until page 85 gets OCRed), I decided to generate an extract of this pdf with Evince "print to file" feature.
Now the pdf produced by Evince (producer/creator is cairo 1.14.8), Tesseract handles it successfully (ie the image gets OCRed).
The difference is the image resolution. The image that fails is 4991x3508 pixels whereas the one that succeeds is only 3507x2480 pixels.
Please note : tesseract in "Sparse text with OSD" (ie --psm 12) handles the page "successfully" although the text (on 2 columns) is not understandable (ie the 2 columns are mixed)
EDIT after several trials and errors
It looks like the input image has to have a width strictly less than 4000 pixels to work with page segmentation. Looking at Tesseract source code, in a class called "pgedit" the canvas size seems limited to 4000 x 4000 as the constructor of a "ScrollView" (for whatever it serves) is :
ScrollView::ScrollView(const char* name, int x_pos, int y_pos, int x_size,
int y_size, int x_canvas_size, int y_canvas_size, bool y_axis_reversed)
So my question now is, why is there a limit of 4000 pixels wide / high to use page segmentation, and what should I do if a pdf page converted to png at 300dpi exceeds 4000 pixels (either wide or high or both) ?
Any help appreciated,
I have some questions about making tiff/box files for tesseract 4.
In TrainingTesseract 4.00 document written:
Making Box Files As with base Tesseract, there is a choice between
rendering synthetic training data from fonts, or labeling some
pre-existing images (like ancient manuscripts for example).
But it did not explain how to train with pre-existing images.
I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.
How can I make tif/box for tessearct 4 lstm then label them and
how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian
language is right to left )?
Should I use fine tuning or train from Scratch?
I was struggling just like you, until I found this github repository:
https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
1) Use below command to make lstmbox:
tesseract test.tif test-lstmbox -l eng --psm 6 lstmbox
It will make a lstmbox for you but you have to correct the character in box file.
2) You require enough data for training from Scratch So I suggest fine tuning is better option.
I am using tesseract to get text from an image, I am only interested in numbers. I have trained tesseract and created a new language that is the exact font in the image and the training data only included numbers. In the training data I also included every possible value that would be in an image, 1-5000 to be specific and also created a wordlist of these same values. However it still mixes up 1 and 7, as well as sometimes 3 and 8. Does anybody have any recommendations on whether I should retrain differently or do some processing on the image before giving it to tesseract?
Make sure there are at least 20 instances of every character in the training texts you provide to tesseract. I give at least 6 pages of the same font to have a decent training sample size.
2.Tesseract Text Recognition also depends on the image quality. Check out possible preprocessing algorithms you can use: Improve Quality of Tesseract
Take a look at the number_dawg file. Modifying it can help recognising digits.
I'm using Computer Vision System Toolbox in Matlab (R2015a, Windows7) to mask frames in the video file and write them into a new video file. By masking, I replace about 80% of the image with 0s and 1s:
videoFileReader = vision.VideoFileReader(fin);
videoFileWriter=vision.VideoFileWriter(fout, ...
'FileFormat', 'MPEG4', 'FrameRate', videoFileReader.info.VideoFrameRate);
frame = step(videoFileReader);
frame_new=mask(frame); %user function
step(videoFileWriter, frame_new);
The size (1080 x 1920 x 3) and the format (single) of the original and modified frames remain the same. Yet the masked videos are much bigger than the original ones, e.g. 1GB original video turns into nearly 4GB after masking. These large new files can not be opened (Windows 7, VLC media). Handbrake does not recognize them as a legit video file either.
When I mask only about 20% of the image, the masked video still come out large (up to 2.5Gb), but I have no problem opening these.
I tried adding 'VideoCompressor', 'MJPEG Compressor', but this gives a warning.
videoFileWriter=vision.VideoFileWriter(fin, 'FileFormat', 'MPEG4', ...
'FrameRate', videoFileReader.info.VideoFrameRate, 'VideoCompressor', 'MJPEG Compressor');
<...>
Warning: The VideoCompressor property is not relevant in this configuration of the System object.
We have TBs of video data to deidentify, so any suggestion would be much appreciated.
Thanks!
Larissa,
The size of the output MPEG-4 file can be controlled by adjusting the Quality parameter of the system object. This is a value from 0-100 which controls the output bitrate. So, higher the quality, larger the file. The default value is 75. The system object uses the Microsft API's to create MPEG-4 files.
Secondly, you need to call release(videoFileWriter) to complete writing the file. I just want to confirm that you are doing it and have just omitted it for the purposes of this code snippet.
The VideoCompressor property is not valid for MPEG-4 file format because the compressor to be used is fixed. You can choose that property only when you write out AVI files. However, you probably will not reach the same level of compression as MPEG-4.
Hope this helps.
Dinesh
Download ffmpeg here:https://git.ffmpeg.org/ffmpeg.git
For windows, open a bash terminal and run:
cat <path to folder with images>/*.png | <path to ffmpeg bin folder>/ffmpeg.exe -f image2pipe -i - output.mkv
For unix, do similar but download the appropriate build of ffmpeg.
I tried on a 7.90GB folder and got a 6.4MB .mkv-file. Works like a charm!