Need help in training Tesseract with application images - tesseract

We are using tesseract to perform actions and verifications for our application. But we are facing issues for some of the characters. As i have never tried training if any one can help with this.
Any link for how to train with application image.
how to prepare training data.
How to do the training in windows if possible.
Thanks in advance

The tesseract-ocr/tesstrain could train some of the characters according the images and their ground truth.
But if you want to run it on windows, you need the following tools:
setup Cygwin on your computer and make sure that the weget/bc/make packages have been installed.
setup python3
tesseract4.0+ and its training tools
Then you need change some places in the makefile in tesseract-ocr/tesstrain:
change TESSDATA to the .traineddata in your computer.
change the Path of the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE:
WORDLIST_FILE = data/$(MODEL_NAME).wordlist
NUMBERS_FILE := data/$(MODEL_NAME).numbers
PUNC_FILE := data/$(MODEL_NAME).punc
Download the .wordlist/.numbers/.punc files from the langdata_lstm. e.g,if your language is english, you should download them from langdata_lstm/eng. Rename them as $(MODEL_NAME).wordlist/$(MODEL_NAME).numbers/$(MODEL_NAME).punc
find the code:
$(ALL_GT): $(shell find $(GROUND_TRUTH_DIR) -name '*.gt.txt')
and change it to :
$(ALL_GT): $(wildcard $(GROUND_TRUTH_DIR)/*.gt.txt)
find the code:
$(ALL_LSTMF): $(patsubst %.gt.txt,%.lstmf,$(shell find $(GROUND_TRUTH_DIR) -name '*.gt.txt'))
#mkdir -p $(OUTPUT_DIR)
find $(GROUND_TRUTH_DIR) -name '*.lstmf' | python3 shuffle.py $(RANDOM_SEED) > "$#"
and change it to :
$(ALL_LSTMF): $(patsubst %.gt.txt,%.lstmf,$(wildcard $(GROUND_TRUTH_DIR)/*.gt.txt))
#mkdir -p $(OUTPUT_DIR)
find $(GROUND_TRUTH_DIR) -name '*.lstmf' -exec echo {} \; | sort -R -o "$#"
Change all the python3 to python.
Then you could do as the instructions on tesseract-ocr/tesstrain. Search tesstrain windows in Github for help.

Related

Git Bash find exec recursively on folders and files containing spaces

Question: In Git Bash on windows, how would you run the following in a way that it will also search folders with spaces in the name, and execute on files with spaces in the name?
$ find ./ -type f -name '*.png' -exec sh -c 'cwebp -q 75 $1 -o "${1%.png}.webp"' _ {} \;
Context I'm running Git Bash on windows, trying to execute a command on all found .png files to convert them to .webp format. It works for all files without spaces in the path, but it's failing to find files with spaces in the filename or files within folders that have spaces in the folder name.A few considerations:
I have many, many levels of folders to iterate through, and I can't run this command separately for each. I really need the recursion to work.I cannot change the folder names; it will break other dependencies (nor did I create the folder or filenames originally, so cut me some slack!)I arrived here by following the suggestions from this article: https://www.smashingmagazine.com/2018/07/converting-images-to-webp/the program, to my knowledge, doesn't ship with any built-in recursive command... golly that'd be handy
Any help you can provide will be appreciated. Thanks!

tcsh autocompletion for modulefiles

I found this piece of code, which does auto-completion for module files in tcsh at
https://opensource.apple.com/source/tcsh/tcsh-66/tcsh/complete.tcsh.
Could somebody help me understand how the 'alias Compl_module' works?
#from Dan Nicolaescu <dann#ics.uci.edu>
if ( $?MODULESHOME ) then
alias Compl_module 'find ${MODULEPATH:as/:/ /} -name .version -o -name .modulea\* -prune -o -print | sed `echo "-e s#${MODULEPATH:as%:%/\*##g -e s#%}/\*##g"`'
complete module 'p%1%(add load unload switch display avail use unuse update purge list clear help initadd initrm initswitch initlist initclear)%' \
'n%{unl*,sw*,inits*}%`echo "$LOADEDMODULES:as/:/ /"`%' \
'n%{lo*,di*,he*,inita*,initr*}%`eval Compl_module`%' \
'N%{sw*,initsw*}%`eval Compl_module`%' 'C%-%(-append)%' 'n%{use,unu*,av*}%d%' 'n%-append%d%' \
'C%[^-]*%`eval Compl_module`%'
endif
Thanks a lot.
Not sure this Compl_module alias is performing well as it tries to determine all existing modulefiles in modulepaths by just looking at existing files. Modulefiles can also be aliases, symbolic versions and virtual (in newer Modules versions >=4.1), so the Compl_module alias will miss that.
You will find a full completion script for the module command in the source repository of the Modules project.
This completion script calls module avail to correctly get all existing modulefiles in enabled modulepaths.
TCSH completion script is automatically enabled starting Modules version 4.0.

Recursively replace colons with underscores in Linux

First of all, this is my first post here and I must specify that I'm a total Linux newb.
We have recently bought a QNAP NAS box for the office, on this box we have a large amount of data which was copied off an old Mac XServe machine. A lot of files and folders originally had forward slashes in the name (HFS+ should never have allowed this in the first place), which when copied to the NAS were all replaced with a colon.
I now want to rename all colons to underscores, and have found the following commands in another thread here: pitfalls in renaming files in bash
However, the flavour of Linux that is on this box does not understand the rename command, so I'm having to use mv instead. I have tried using the code below, but this will only work for the files in the current folder, is there a way I can change this to include all subfolders?
for f in *.*; do mv -- "$f" "${f//:/_}"; done
I have found that I can find al the files and folders in question using the find command as follows
Files:
find . -type f -name "*:*"
Folders:
find . -type d -name "*:*"
I have been able to export a list of the results above by using
find . -type f -name "*:*" > files.txt
I tried using the command below but I'm getting an error message from find saying it doesn't understand the exec switch, so is there a way to pipe this all into one command, or could I somehow use the files I exported previously?
find . -depth -name "*:*" -exec bash -c 'dir=${1%/*} base=${1##*/}; mv "$1" "$dir/${base//:/_}"' _ {} \;
Thank you!
Vincent
So your for loop code works, but only in the current dir. Also, you are able to use find to build a file with all the files with : in the filename.
So, as you've already done all this, I would just loop over each line of your file, and perform the same mv command.
Something like this:
for f in `cat files.txt`; do mv $f "${f//:/_}"; done
EDIT:
As pointed out by tripleee, using a while loop is a better solution
EG
while read -r f; do mv "$f" "${f//:/_}"; done <files.txt
Hope this helps.
Will

Find unused resource files (.jsp, .xhtml, images) in Eclipse

I'm developing a large web application in Eclipse and some of the resources (I'm talking about files, NOT code) are getting deprecated, however, I don't know which are and I'm including them in my ending war file.
I know Eclipse recognizes file paths into its directory because I can access the link to an image or other page while I'm editing one of my xhtml pages (using Control). But is there a way to localize the unused resources in order to remove them?
Following these 3 steps would work for sites with a relatively finite number of dynamic pages:
Install your site on a filesystem mount'ed with atime (access time).
Try harvesting the whole site with wget.
Use find to see which files were not accessed recently.
Done.
As I know Eclipse doesn't have this (need this too).
I'm using grep in conjuction with bash scripting - shell script takes files in my resource folder, put filenames in list, greping throught source code for every record in the list and if grep find it it is removed.
At the end list is printed on console - just unused resources retain in the list.
UCDetector might be your best bet, specifically, the custom marker aspects of this tool.
In Eclipse I have not found a way. I have used the following shell command script.
Find .ftl template files which are NOT referenced in .java files
cd myfolder
find . -name "*.ftl" -printf "%f\n" |while read fname; do grep --include \*.java -rl "$fname" . > /dev/null || echo "${fname} not referenced" ; done;
or
Find all .ftl template files which are NOT referenced in .java, .ftl, .inc files
cd myfolder
find . -name "*.ftl" -printf "%f\n" |while read fname; do grep --include \*.java --include \*.ftl --include \*.inc -rl "$fname" . > /dev/null || echo "${fname} not referenced" ; done;
Note: on MacOSX you can use gfind instead of find in case -printf is not working.
Example output
productIndex2.ftl not referenced
showTestpage.ftl not referenced

Definition of a symbol in a .so file on Solaris

Could somebody tell me how to find the definition of a symbol in a shared object file on Solaris.
Thanks
Raj
On the Solaris machines I have access to nm is available and can be used for this. For instance:
nm /usr/lib/libc.so
Shows all of the symbols in libc.so and then checking if a symbol is defined in this library is simply a matter of reading through the output.
Probably you want to pass the -g and -D options too for most cases. If you're looking to search a bunch of libraries you could try using:
find /usr/lib -name '*.so' -exec nm -gD {} \; |grep "symbol_name"
Or similar