Tesseract: Advantage to Multi-Page Training File vs. Multiple Separate Files? - tesseract

This SO answer suggests that training tesseract with .tif files has an advantage over .png files because the .tif files can have multiple pages and thus a larger training sample. Yet, this SO question discusses procedures for training with multiple images at once. More so, the man page for, e.g. mftraining suggests that it can accept multiple training files.
Is there any reason then not to train with multiple separate image files?

It appears that using multiple images to train tesseract on a single font seems to work just fine. Below is a sketch of the workflow I employ:
# Convert files to .pdf
convert -density 600 Page1.pdf eng1.MyNewFont.exp1.png
convert -density 600 Page2.pdf eng1.MyNewFont.exp2.png
# Create .box files
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1 -l eng batch.nochop makebox
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2 -l eng batch.nochop makebox
## correct boxes with jTessBoxEditor or another box editor ##
# Create two new box.tr files: eng1.MyNewFont.exp1.box.tr and eng1.MyNewFont.exp2.box.tr
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1.box -l eng1 nobatch box.train.stderr
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2.box -l eng1 nobatch box.train.stderr
# Extract characters from the two .box files
unicharset_extractor eng1.MyNewFont.exp1.box eng1.MyNewFont.exp2.box
echo "MyNewFont 0 0 0 0 0" >> font_properties
# train using the two new box.tr files.
mftraining -F font_properties -U unicharset -O eng1.unicharset eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr
cntraining eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr
## rename files
mv inttemp eng1.inttemp
mv normproto eng1.normproto
mv pffmtable eng1.pffmtable
mv shapetable eng1.shapetable
combine_tessdata eng1. ## create .traineddata file.

You can certainly train with multiple image files; Tesseract would treat them as having different, separate fonts. And there is a limit (64) on the number of images. If they share a common font, it would be better to put them in a multi-page TIFF. According to its specs, a TIFF file can be a container holding many images.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
https://en.wikipedia.org/wiki/Tagged_Image_File_Format

Related

How do I set up configuration variables in Tesseract to better recognize code?

I want to use Tesseract to recognize code. It is said on their website that I can disable dictionaries by setting both of the configuration variables load_system_dawg and load_freq_dawg to false.
However I haven't been able to do it correctly.
$ tesseract img.jpg output.txt --oem 0 -c load_system_dawg=0 load_freq_dawg=0
read_params_file: Can't open load_freq_dawg=0
Error: Tesseract (legacy) engine requested, but components are not present in /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Any ideas on best ways to handle it?
First of all, get eng.traineddata with the legacy engine or other OCR engine value (OEM).
Next, read the output of tesseract --help-extra carefully:
-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed.

wget: Download image files from URL list that are >1000KB & strip URL parameters from filename

I have a text file C:\folder\filelist.txt containing a list of numbers, for example:
 
345651
342679
344000
349080
I want to append the URL as shown below, download only the files that are >1000KB, and strip the parameters after "-a1" from the filename, for example:
URL
Size
Output File
https://some.thing.com/gab/abc-345651-def-a1?scl=1&fmt=jpeg
1024kb
C:\folder\abc-345651-def-a1.jpeg
https://some.thing.com/gab/abc-342679-def-a1?scl=1&fmt=jpeg
3201kb
C:\folder\abc-342679-def-a1.jpeg
https://some.thing.com/gab/abc-342679-def-a1?scl=1&fmt=jpeg
644kb
-
https://some.thing.com/gab/abc-349080-def-a1?scl=1&fmt=jpeg
2312kb
C:\folder\abc-349080-def-a1.jpeg
This is the code I currently have, which works for downloading the files and appending the .jpeg extension, given the full URL is in the text file. It does not filter out the smaller images or strip the parameters following "-a1".
cd C:\folder\
wget --adjust-extension --content-disposition -i C:\folder\filelist.txt
I'm running Windows and I'm a beginner at writing batch scripts. The most important thing 'm trying to accomplish is to avoid downloading images <1000kb: it would be acceptable if I had to manually append the URL in the text file and rename the files after the fact. Is it possible to do what I'm trying to do? I've tried modifying the script by referencing the posts below, but I can't seem to get it to work. Thanks in advance!
Wget images larger than x kb
Downloading pdf files with wget. (characters after file extension?)
Spider a Website and Return URLs Only
#change working directory
cd /c/folder/
#convert input file list to unix
dos2unix filelist.txt
for image in $(cat filelist.txt)
do
imageURL="https://some.thing.com/gab/abc-$image-def-a1?scl=1&fmt=jpeg"
size=`wget -d -qO- "$imageURL" 2>&1 | grep 'Content-Length' | awk {'print $2'}`
if [[ $size -gt 1024000 ]] ;then
imgname="/c/folder/abc-$image-def-a1.jpeg"
wget -O $imgname $imageURL
fi
done

How to reduce the file size on JPEG images in batch (/Mac)?

I have a list of .JPG files on a Mac. I want to export them to a format taking less than 500 kilobytes per image.
I know how to do that using the Preview application one image at a time; but I want to be able to do the same in batch, meaning on several files at once. Is there a command line way to do it so I could write a script and run it in the terminal?
Or some other way that I could use?
This is an example from the command line using convert (brew info imagemagick) converting all *.jpg images in one directory to .png:
$ for i in *.jpg; do
convert "$i" "${i%.jpg}.png"
done
To test before (dry-run) you could use echo instead of the <command>:
$ for i in *.jpg; do
echo "$i" "${i%.jpg}.png"
done
This will search for files within the directory having the extension .jpg then execute the command convert passing as arguments the file name $i and then using as an output the same file name removing the extension and adding the new one .png, this is done using:
"${i%.jpg}.png"
The use of double quotes " is for the case file could contain spaces, check this for more details: shell parameter expansion
For example, to just change the quality of the file you could use:
convert "$i" -quality 80% "${i%.jpg}-new.jpg"
Or if no need to keep the original:
mogrify -quality 80% *.jpg
The main difference is that ‘convert‘ tends to be for working on individual images, whereas ‘mogrify‘ is for batch processing multiple files.
Install ImageMagick. (Really.. it's lightweight and amazing) It's nice to install using Homebrew. Then...
Open terminal.
cd [FilepathWithImages] && mogrify -define jpeg:extent=60kb -resize 400 *.JPG
Wait until the process is complete (may take a few minutes if you have many images)
To check file sizes, try du -sh * to see the size of each file in the directory you're in.
NOTE: *.JPG must be uppercase for it to work
How this works:
cd [yourfilepath] will naviage to the directory you want to be in
&& is used for chaining commands
mogrify is used when you want to keep the same filename
-define jpeg:extent=60kb sets the maximum filesize to 60kb
-resize 400 will set the width
*.JPG is for all files in the directory you're in.
There are many additional commands you can use with imagemagick convert and mogrify. After installing it, you can use man mogrify to see the commands you can chain to it.
According to the docs, "Restrict the maximum JPEG file size, for example -define jpeg:extent=400KB. The JPEG encoder will search for the highest compression quality level that results in an output file that does not exceed the value. The -quality option also will be respected starting with version 6.9.2-5. Between 6.9.1-0 and 6.9.2-4, add -quality 100 in order for the jpeg:extent to work properly. Prior to 6.9.1-0, the -quality setting was ignored."
Install ImageMagick from Homebrew or MacPorts or from https://imagemagick.org/script/download.php#macosx. Then use mogrify to process all files in a folder using -define jpeg:extent=500KB saving to JPG.
I have two files in folder test1 on my desktop. Processing will put them into folder test2 on my desktop
Before Processing:
mandril.tif 3.22428MB (3.2 MB)
zelda.png 726153B (726 KB)
cd
cd desktop/test1
mogrify -path ../test2 -format jpg -define jpeg:extent=500KB *
After Processing:
mandril.jpg 358570B (359 KB)
zelda.jpg 461810B (462 KB)
See https://imagemagick.org/Usage/basics/#mogrify
The * at the end means to process all files in the folder. If you want to restrict to only jpg then change it to *.jpg. The -format means you intend the output to be jpg.
DISCLAIMER: BE CAREFUL BECAUSE THE FOLLOWING SOLUTION IS A "DESTRUCTIVE" COMMAND, FILES ARE REPLACED WITH LOWER QUALITY DIRECTLY
Now that you have read my disclaimer, I would recommend to get cwebp that you can download here.
You will also need parallel sudo apt-get install -y parallel and then I coined the following script:
parallel cwebp {} -resize 0 640 -m 6 -sns 100 -q 80 -preset photo -segments 4 -f 100 -o {} ::: *.jpg && /
find -name "*.jpg" | parallel 'f="{}" ; mv -- {} ${f:0:-3}webp'
640 is the resulting file height in pixels and 0 before means that the width will adapt to the ratio between width and height.
I reduced quality to 80% (-q 80), you will not notice much difference.
The second line find all the files that have been converted but still have the wrong extension file (.jpg), so it removes the last 3 characters (jpg) and add webp instead.
I went from 5 Mb to about 50k per file (.jpg images were 4000x4000 pixels) , just saved 20 Gb of storage. I hope you enjoy it !
If you don't want to bother with webp format you can use the following instead (you will need to install imageMagick perhaps):
parallel convert {} -resize x640 -sampling-factor 4:2:0 -strip -quality 85 \
-interlace JPEG -colorspace RGB -define jpeg:dct-method=float {} ::: *.jpg

How do you save a file to an output.txt when using tessedit_char_whitelist in tesseract?

I have managed to use
tesseract image.jpg output.txt
to read the text on an image file and save it as a text file, but now I am trying to use more specific commands with tesseract and it is trying to open the output file rather than saving into it
I am trying to use
tesseract image.jpg stdout -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ%/-15 TextOutput
I have literally just started using tesseract so I may well be making a stupid mistake
I figured out that if you insert a > after the specific commands it works
like this
tesseract image.jpg stdout -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ%/-1250 > TextOutput.txt

How create a single file with dvipng command?

Using the command dvipng I can generate png images via a dvi file.
dvipng -D 200 worksheet.dvi
However, the command generates a file per page. How I can create a single image file with all pages?