Training Tesseract for a new font - tesseract

When creating the CLUSTERING data using
mftraining -F font_properties -U unicharset -O lan.unicharset *.tr
I get the following message
C:\Users\ \AppData\Local\Tesseract-OCR>mftraining -F font_properties -U unicharset -O eng1.unicharset eng.lucidaconsole.box.tr <http://eng.lucidaconsole.box.tr>
Warning: No shape table file present: shapetable
Failed to load unicharset from file unicharset
Building unicharset for training from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Reading eng.lucidaconsole.box.tr <http://eng.lucidaconsole.box.tr> ...
Flat shape table summary: Number of shapes = 0 max unichars = 0 number with multiple unichars = 0
Done!
It rebuilt the unicharset I had done already and gives me one with 1kb
worth of data with only this in it
1
NULL 0 NULL 0
At this point I don't know what to do. I am a first time user to this program but to me this doesn't seem right?

It looks like you need to cluster the the character features of the training pages, as described here.
I believe the basic command for this is something like:
shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
This appears to be something that was added in version 3.02.

If you're using Windows,I think this tool can help you to make the training process much MUCH easier. I've been through a lot of troubles learning how to train Tesseract before using it. Just download the latest version and read the User manual, you will be able to train you Tesseract without touching the keyboard!

Related

raster2pgsql: "Could not allocate memory for INSERT statement"

I'm very new to raster2pgsql so please bear with me. I'm trying to load a 60mb .tif (from the High-Resolution Settlements Layer project) to my postgis-enabled database with the following code:
raster2pgsql -s 5235 -C -F [path to the .tif] public.hrsl_lka | psql
-h localhost -U postgres -p 5432 -d project
However, I get the following error:
ERROR: insert_records: Could not allocate memory for INSERT statement
ERROR: process_rasters: Could not convert raster tiles into INSERT or
COPY statements ERROR: Unable to process rasters
Loading smaller .tifs of around 3mb to the same database but from other sources works fine, however.
Is there a size limit with raster2pgsql? I'm on PostgreSQL 12.4.
With many thanks,
Gregor
Have you tried setting the tile size -t?
According to the documentation:
-t: Tile size - expressed as width x height. If not provided, a default is worked out automatically in the range of 32-100 so it best
matches the raster dimensions. It is worth remembering that when
importing multiple files, tiles will be computed for the first raster
and then applied to others.
Alternatively you can let the script compute it for you by means of setting -t to auto e.g.
raster2pgsql -s 5235 -t auto -C -F file.tif public.hrsl_lka | psql -d db
Related answer: Are there limitations using a PostGIS out-db raster?

what is mongodb archive format?

I've backed up some mongoDBs using their archive option, but I can't simply untar them. When I go through some steps to decompress the data it looks like the archive is the whole DB in one big file.
I wanted to get at the files for the individual collections.
Is there a way to do that?
$ tar -xvf valk.archive
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
$ file valk.archive
valk.archive: gzip compressed data, original size 13953183
$ gunzip valk.archive
gunzip: valk.archive: unknown suffix -- ignored
$ unzip valk.archive
Archive: valk.archive
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of valk.archive or
valk.archive.zip, and cannot find valk.archive.ZIP, period.
$ mv valk.zip valk.gz
$ gunzip valk.gz
$ open .
$ tar -xvf valk
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
$ head valk
TemplateDatametadata�{"options":{},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.TemplateData"}],"uuid":"f52402b5aba24856b072d57cc3e46a72"}size-dbvalkcollectioMetricsmetadata�{"options":{"capped":true,"size":10485760,"max":1000000},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.Metrics"},{"v":2,"key":{"openid":1},"name":"openid_1","ns":"valk.Metrics"}],"uuid":"43d92ff01815432c95dac5a2e05a64c0"}size�dbvalkcollection
AppConfigmetadata�{"options":{},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.AppConfig"}],"uuid":"df633b0a43184de38e8b8ea7489cda3e"}size�dbvalkcollecMinibotZonesmetadata�{"options":{},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.MinibotZones"}],"uuid":"095bbac0d17640be9e27dffe681b7d83"}size�dbvalkcollection ChatLogsmetadataQ{"options":{"capped":true,"size":104857600,"max":10000000},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.ChatLogs"},{"v":2,"key":{"openid":1,"createdAt":1},"name":"openid_1_createdAt_1","ns":"valk.ChatLogs"},{"v":2,"key":{"createdAt":1},"name":"createdAt_1","ns":"valk.ChatLogs"}],"uuid":"70586c82b3ae42cf8d9c47ad339ea55b"}size�dbvalkcollection
The mongodump archive format is a special purpose format; you need to use mongorestore --archive with any other options that are appropriate.
For example, you can use the --nsInclude option (mongorestore 3.4+) to selectively restore multiple collections by namespace.
For more information on the MongoDB archive format (and why tar wasn't suitable), see:
Archiving and Compression in MongoDB Tools. The gist of this is:
General purpose archive formats, like tar, only support contiguous file packing within the archive. Using these archive formats for mongodump and mongorestore will create an unacceptable performance degradation as data from all collections will have to be written to and read from, in order. To support the concurrent behavior of these tools, we developed a special purpose archive format that supports non-contiguous files writes. The new archiving feature provides major gains in the efficiency of backup and restore operations.

wget --warc-file --recursive, prevent writing individual files

I run wget to create a warc archive as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/
$ l -h /tmp/epfl.warc.gz
-rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz
$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]
I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?
I tried as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.
tl;dr Add the options --delete-after and --no-directories.
Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.
Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.
The below demonstrates the result, using your given example (slightly altered).
$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
--warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.
For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?

Mahout clustering: How to retrieve the name of a named vector

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.
I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?
Right now I am doing the following steps:
I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:
filename: documentID.txt
[TITLE]
[CONTENT]
I create a sparse directory with namedVectors using:
./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector
Then I can cluster the results and create a dump:
./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints
The dump looks like this:
:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
Top Terms:
epa => 13.471728324890137
mountaintop => 11.364262580871582
mine => 10.942587852478027
Weight : [props - optional]: Point:
[...]
k-means in Mahout is only a toy.
You can use it for howtos and tutorials, but for real use it is too slow, too limited, roo hard to use. (Also, k-means results are not half as good as people think... most of the time they are dogfood.)
Benchmark other tools, and you'll be surprised big time.
I found a way. You can use the seqdumper to extract the cluster mapping:
./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt
Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.

font_properties error while training tesseract

While training Tesseract, I encountered an error like, "Failed to load font_properties from font_properties". I am running the command -
shapeclustering -F font_properties -U unicharset pristina.tr
My font_properties file is something like--> pristina 0 1 0 0 0.
I am taking help from this blog.
You need to follow Tesseract filename standard for image and box files: [lang].[fontname].exp[num]
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3