Issue with training tesseract 4.0

Issue with training tesseract 4.0 - tesseract

Now I want to train tesseract for arabic language with specific font exactly the digits
According to tesseract docs
You must create data first and I have this now
Training
This command to create the training data and eval lists
$TRAINING/tesstrain.sh --fonts_dir $FOLDER/simplified-arabic --lang ara --linedata_only \
--noextract_font_properties --langdata_dir $FOLDER \
--tessdata_dir $FOLDER/arabox/tessdata \
--fontlist "Simplified Arabic Bold" --output_dir $FOLDER/araeval
Output
This is the output
=== Starting training for language 'ara'
[ر أبر 25 22:17:23 EET 2018] /usr/local/bin/text2image --fonts_dir=/home/amir-paymob/WorkSpace/learnopencv/simplified-arabic --font=Simplified Arabic Bold --outputbase=/tmp/font_tmp.sUvtJ9ehJT/sample_text.txt --text=/tmp/font_tmp.sUvtJ9ehJT/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.sUvtJ9ehJT
Rendered page 0 to file /tmp/font_tmp.sUvtJ9ehJT/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Simplified Arabic Bold
[ر أبر 25 22:17:24 EET 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.sUvtJ9ehJT --fonts_dir=/home/amir-paymob/WorkSpace/learnopencv/simplified-arabic --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0 --max_pages=3 --font=Simplified Arabic Bold --text=/home/amir-paymob/WorkSpace/learnopencv/ara/ara.training_text
Rendered page 0 to file /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset --norm_mode 2 /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.box
Extracting unicharset from box file /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.box
Wrote unicharset file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset -O /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset -X /tmp/tmp.bBjBa5bzUW/ara/ara.xheights --script_dir=/home/amir-paymob/WorkSpace/learnopencv
Loaded unicharset of size 13 from file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/amir-paymob/WorkSpace/learnopencv/facerec/tessdata
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/tesseract /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.tif /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0 lstm.train /home/amir-paymob/WorkSpace/learnopencv/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1-69-g10f4 with Leptonica
Page 1
=== Constructing LSTM training data ===
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset --script_dir /home/amir-paymob/WorkSpace/learnopencv --words /home/amir-paymob/WorkSpace/learnopencv/ara/ara.wordlist --numbers /home/amir-paymob/WorkSpace/learnopencv/ara/ara.numbers --puncs /home/amir-paymob/WorkSpace/learnopencv/ara/ara.punc --output_dir /home/amir-paymob/WorkSpace/learnopencv/araeval --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 13 from file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
Setting unichar properties
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Error during conversion of wordlists to DAWGs!!
Moving /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.lstmf to /home/amir-paymob/WorkSpace/learnopencv/araeval
Completed training for language 'ara'
The problem
Error during conversion of wordlists to DAWGs!!
I didn't understand what is that?
In the training itself using the lstm file
$TRAINING/lstmtraining --debug_interval 0 --max_iterations 1 \
--traineddata $FOLDER/arabox/tessdata/ara.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output $OUTPUT/base --learning_rate 20e-4 \
--train_listfile $FOLDER/araeval/ara.training_files.txt \
--max_iterations 5000 &>$OUTPUT/basetrain.log
it enters in encoding infinite loop, can't encode the character
NOW
According to documentation
Bold elements must be provided. Others are optional, but if any of the dawgs are provided, the punctuation dawg must also be provided. A new tool: combine_lang_data is provided to make a starter traineddata from a unicharset and optional wordlists.
but what is the combine_lang_data

Related

postgresql pgbadger error - can not load incompatible binary data, binary file is from version < 4.0

I am trying to use pgbadger to make html report for postgres slow query log files. My postgres logfiles are in csvlog format in folder pg_log. I transfer all logfiles
(80 files with 10 MB each) to my local windows machine and trying to generate single html report for all files. I created all one file from all files in below way,
type postgresql-2020-06-18_075333.csv > postgresql.csv
type postgresql-2020-06-18_080011.csv >> postgresql.csv
....
....
type postgresql-2020-06-18_094812.csv >> postgresql.csv
I downloaded "pgbadger-11.2" and tried below command but getting error.
D:\pgbadger-11.2>perl --version
This is perl 5, version 28, subversion 1 (v5.28.1) built for MSWin32-x64-multi-thread
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql.csv" -o postgresql.html
[========================>] Parsed 923009530 bytes of 923009530 (100.00%), queries: 1254764, events: 53
can not load incompatible binary data, binary file is from version < 4.0.
LOG: Ok, generating html report...
postgresql.html is created but no data in any tab.But it works when i create separate report for individual csv. like below.
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql-2020-06-18_075333.csv" -o postgresql-2020-06-18_075333.html
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql-2020-06-18_080011.csv" -o postgresql-2020-06-18_080011.html
...
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql-2020-06-18_094812.csv" -o postgresql-2020-06-18_094812.html
Please suggest me something to fix this issue.

I going to say this due to:
type postgresql-2020-06-18_075333.csv > postgresql.csv
type postgresql-2020-06-18_080011.csv >> postgresql.csv
Pretty sure that is introducing Windows line endings and pgBadger is looking for Unix line endings. Can you do the concatenate on the server?
UPDATE. Hmm. Ran across this
https://github.com/darold/pgbadger/releases
"This new release breaks backward compatibility with old binary or JSON
files. This also mean that incremental mode will not be able to read
old binary file [...] Add a warning about version and skip loading incompatible binary file.
Update code formatter to pgFormatter 4.0."
Not sure why it is failing on CSV logs, still what is version of pgBadger generating logs?

CentOS EPEL fail2ban not processing systemd journal for tomcat

I've installed fail2ban 0.10.5-2.el7 from EPEL on CentOS 7.8. I'm trying to get it to work with systemd for processing a Tomcat log (also systemd).
In jail.local I added:
[guacamole]
enabled = true
port = http,https
backend = systemd
In filter.d/guacamole.conf:
[Definition]
failregex = Authentication attempt from <HOST> for user "[^"]*" failed\.$
ignoreregex =
journalmatch = _SYSTEMD_UNIT=tomcat.service + _COMM=java
If I run journalctl -u tomcat.service I see all the log lines. The ones I am interested in look like this:
May 18 13:58:26 myhost catalina.sh[42065]: 13:58:26.485 [http-nio-8080-exec-6] WARN o.a.g.r.auth.AuthenticationService - Authentication attempt from 1.2.3.4 for user "test" failed.
If I redirect journalctl -u tomcat.service to a log file, and process it with fail2ban-regex then it works exactly the way I want it to work, finding all the lines it needs.
% fail2ban-regex /tmp/j9 /etc/fail2ban/filter.d/guacamole.conf
Running tests
=============
Use failregex filter file : guacamole, basedir: /etc/fail2ban
Use log file : /tmp/j9
Use encoding : UTF-8
Results
=======
Failregex: 47 total
|- #) [# of hits] regular expression
| 1) [47] Authentication attempt from <HOST> for user "[^"]*" failed\.$
`-
Ignoreregex: 0 total
Date template hits:
|- [# of hits] date format
| [1] ExYear(?P<_sep>[-/.])Month(?P=_sep)Day(?:T| ?)24hour:Minute:Second(?:[.,]Microseconds)?(?:\s*Zone offset)?
| [570] {^LN-BEG}(?:DAY )?MON Day %k:Minute:Second(?:\.Microseconds)?(?: ExYear)?
`-
Lines: 571 lines, 0 ignored, 47 matched, 524 missed
[processed in 0.12 sec]
However, if fail2ban reads the journal directly then it does not work:
fail2ban-regex systemd-journal /etc/fail2ban/filter.d/guacamole.conf
It comes back right away, and processes 0 lines!
Running tests
=============
Use failregex filter file : guacamole, basedir: /etc/fail2ban
Use systemd journal
Use encoding : UTF-8
Use journal match : _SYSTEMD_UNIT=tomcat.service + _COMM=java
Results
=======
Failregex: 0 total
Ignoreregex: 0 total
Lines: 0 lines, 0 ignored, 0 matched, 0 missed
[processed in 0.00 sec]
I've tried to remove _COMM=java. It doesn't make a difference.
If I leave out the journal match line altogether, it at least processes all the lines from the journal, but does not find any matches (even though, as I mentioned, it processes a dump of the log file fine):
Running tests
=============
Use failregex filter file : guacamole, basedir: /etc/fail2ban
Use systemd journal
Use encoding : UTF-8
Results
=======
Failregex: 0 total
Ignoreregex: 0 total
Lines: 202271 lines, 0 ignored, 0 matched, 202271 missed
[processed in 34.54 sec]
Missed line(s): too many to print. Use --print-all-missed to print all 202271 lines
Either this is a bug, or I'm missing a small detail.
Thanks for any help you can provide.

To make sure the filter definition is properly initialised, it would be good to include the common definition. Your filter definition (/etc/fail2ban/filter.d/guacamole.conf) would therefore look like:
[INCLUDES]
before = common.conf
[Definition]
journalmatch = _SYSTEMD_UNIT='tomcat.service'
failregex = Authentication attempt from <HOST> for user "[^"]*" failed\.$
ignoreregex =
A small note given that your issue only occurs with systemd but not flat files, could you try the same pattern without $ at the end? Maybe there is an issue with the end of line when printed to the journal?
In your jail definition (/etc/fail2ban/jail.d/guacamole.conf), remember to define the ban time/find time/retries if they haven't already been defined in the default configuration:
[guacamole]
enabled = true
port = http,https
maxretry = 3
findtime = 1h
bantime = 1d
# "backend" specifies the backend used to get files modification.
# systemd: uses systemd python library to access the systemd journal.
# Specifying "logpath" is not valid for this backend.
# See "journalmatch" in the jails associated filter config
backend = systemd
Remember to restart the fail2ban service after doing such changes.

how to truly turn off LaTex output in Doxygen

My group is using Doxygen 1.8.5 on RHEL7 to generate HTML-only documentation for a large C++ project. We only want HTML documentation and do not desire any other output format. The project's Doxygen configuration file (Doxyfile) carries the following settings which differ from the default (among others, of course):
OUTPUT_DIRECTORY="../Docs"
GENERATE_HTML=YES
GENERATE_LATEX=NO
When we run Doxygen with this config file, towards the end of processing, errors start coming to screen from LaTeX components, and processing is held up until the user hits enter a bunch of times to get past these errors, e.g.
...
sh: epstopdf: command not found
error: Problems running epstopdf. Check your TeX installation!
Generating graph info page...
Generating directory documentation...
</home/abilich/Src/GNSS_Software/Lib/LibgpsC++/>:3: warning: Found unknown command `\reference'
</home/abilich/Src/GNSS_Software/Lib/LibgpsC++/>:2: warning: Found unknown command `\reference'
</home/abilich/Src/GNSS_Software/Lib/LibgpsC++/>:2: warning: Found unknown command `\reference'
Generating bitmaps for formulas in HTML...
This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013)
restricted \write18 enabled.
entering extended mode
(./_formulas.tex
LaTeX2e <2011/06/27>
Babel <v3.8m> and hyphenation patterns for english, dumylang, nohyphenation, lo
aded.
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/graphics/epsfig.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/graphicx.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/share/texlive/texmf-dist/tex/latex/graphics/graphics.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/trig.sty)
(/usr/share/texlive/texmf-dist/tex/latex/latexconfig/graphics.cfg)
(/usr/share/texlive/texmf-dist/tex/latex/graphics/dvips.def))))
No file _formulas.aux.
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
! Undefined control sequence.
l.53 \[ \Vert v - w \Vert \leqslant
p\,\min(\Vert v\Vert, \Vert w\Vert). \]
?
...
! Undefined control sequence.
l.533 $ \vert singular value \vert \leqslant
threshold \times \vert max sing...
?
[177] [178] (./_formulas.aux) )
Output written on _formulas.dvi (178 pages, 20576 bytes).
Transcript written on _formulas.log.
error: Problems running latex. Check your installation or look for typos in _formulas.tex and check _formulas.log!
Generating image form_0.png for formula
sh: gs: command not found
error: Problem running ghostscript gs -q -g60x50 -r384x384x -sDEVICE=ppmraw -sOutputFile=_form0.pnm -dNOPAUSE -dBATCH -- _form0.ps. Check your installation!
Generating index page...
/home/abilich/Src/GNSS_Software/Doxy/mainpage.h:15: warning: image file GNSSSoftwareTimeline.png is not found in IMAGE_PATH: assuming external image.
Generating page index...
Generating module index...
Generating namespace index...
Generating namespace member index...
I want to create a Doxyfile that runs without user intervention and does not throw LaTeX errors. Where have I gone astray?

Thanks for #albert's comment, actually the newest version of doxygen did support to turn latex generation off by setting GENERATE_LATEX=NO:
#---------------------------------------------------------------------------
# Configuration options related to the LaTeX output
#---------------------------------------------------------------------------
# If the GENERATE_LATEX tag is set to YES, doxygen will generate LaTeX output.
# The default value is: YES.
GENERATE_LATEX = NO
The parameter USE_PDFLATEX in your config file should set to NO (the default value is YES)
# If the USE_PDFLATEX tag is set to YES, doxygen will use the engine as
# specified with LATEX_CMD_NAME to generate the PDF file directly from the LaTeX
# files. Set this option to YES, to get a higher quality PDF documentation.
#
# See also section LATEX_CMD_NAME for selecting the engine.
# The default value is: YES.
# This tag requires that the tag GENERATE_LATEX is set to YES.
USE_PDFLATEX = NO

Tesseract OCR mftraining Error code 2001 with 2 fonts in the font_properties file

I am trying create new language data for Japanese with 2 fonts.
1. Arial Unicode MS
2. MS ゴシック　(MS Gothic)
I am not sure how to create font_properties file with command line for two fonts.
Usually I run > echo Arial_Unicode_MS 0 0 1 0 0 >font_properties -- to create font properties file with one font.
Since, I am going to use two fonts I edited the file to add the second font.
But, when I execute mftraining , for the first font whether it is Arial Unicode MS or MS ゴシック it just works fine. But I get "malloc allocation error 2001 iff I reference the second font in the file.
I even used Serak trainer to create font_properties file.
I want to do, as shown below.
1. mftraining.exe -F font_properties -U unicharset -O lang.unicharset lang.font1.exp0.tr
2.mftraining.exe -F font_properties -U unicharset -O lang.unicharset lang.font2.exp0.tr
#1 throws no error if font1 is the first font in the file. But getting error with #2
#2 throws no error if font2 is the first font in the file. But getting error with #1
What is wrong with my steps?
Regards,
Sharon

Your TIFF/Box files should have "Arial_Unicode_MS" in their names, not "font1" or "font2".

command line extract from SWF?

It appears that most SWF files, if not all are actually swf "archives" containing compressed versions of themselves. I have seen that you can extract the file using a few tools
$ flasm -x player.swf
Flasm configuration file flasm.ini not found, using default values
player.swf successfully decompressed, 206239 bytes
$ 7z x player.swf
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
Processing archive: player.swf
Extracting player~.swf
Everything is Ok
Size: 206239
Compressed: 106427
However I was hoping to extract from these using something a little more "conventional", e.g. tar or gzip

Relevant quote from http://www.adobe.com/content/dam/Adobe/en/devnet/swf/pdf/swf_file_format_spec_v10.pdf
The header begins with a three-byte signature of either 0x46, 0x57, 0x53 (“FWS”); or 0x43, 0x57, 0x53 (“CWS”).
An FWS signature indicates an uncompressed SWF file;
CWS indicates that the entire file after the first 8 bytes (that is, after the FileLength field) was compressed by using the ZLIB open standard.
The data format that the ZLIB library uses is described by Request for Comments (RFCs) documents 1950 to 1952. CWS file compression is permitted in SWF 6 or later only.
Update In response to the comment, here's a little bash script that is a literal translation of what the above seems to describe:
#!/bin/bash
for swf in "$#"
do
signature=$(dd if="$swf" bs=1 count=3 2> /dev/null)
case "$signature" in
FWS)
echo -e "uncompressed\t$swf"
;;
CWS)
targetname="$(dirname "$swf")/uncompressed_$(basename "$swf")"
echo "uncompressing to $targetname"
dd if="$swf" bs=1 skip=8 2>/dev/null |
(echo -n 'FWS';
dd if="$swf" bs=1 skip=3 count=5 2>/dev/null;
zlib-flate -uncompress) > "$targetname"
;;
*)
{
echo -e "unrecognized\t$swf"
file "$swf"
} > /dev/stderr
;;
esac
done
Which you'd then run across a set of *.swf files (assume you saved it as uncompress_swf.sh):
uncompress_swf.sh /some/folder/*.swf
It will say stuff like
uncompressed /some/folder/a.swf
uncompressed /some/folder/b.swf
uncompressing to /some/folder/uncompressed_c.swf
If something didn't look like a flash file, at all, it will print an error to stderr.
DISCLAIMER This is just the way I read the quoted spec. I have just checked that using this script resulted in identical output as when I had used 7z x on the input swf.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Issue with training tesseract 4.0 - tesseract

Related

postgresql pgbadger error - can not load incompatible binary data, binary file is from version < 4.0

CentOS EPEL fail2ban not processing systemd journal for tomcat

how to truly turn off LaTex output in Doxygen

Tesseract OCR mftraining Error code 2001 with 2 fonts in the font_properties file

command line extract from SWF?

Categories

Resources