How to train Tesseract on multiple files at once? - command-line

When I first trained Tesseract the tutorial I used showed a way to run the commands on each relevant file, but I can no longer find that.
How could I run this command for each file:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

For a quick and dirty loop, you can try:
for i in *.tif ; do tesseract $i $i.txt; done;
You can also do it with a find -iname ____ path to select from a subset of files.
If you want to really "parse" filenames, you may want to use a scripting language, or get your bash-foo out.

Related

Can Tesseract be used for Sinhala handwritten text recognition?

I wish to restore damaged Sinhala handwritten documents. Please let me know: Can Tesseract be used for Sinhala language also?
Checkout the tessdata folder the from tesseract-ocr GitHub repository:
There's sin.traineddata for the actual Sinhala language, and
there's script/Sinhala.traineddata for the Sinhala script.
Copy one of them (or both) to your tessdata folder, maybe located at C:\tesseract\tessdata on some Windows machine.
For example, running Tesseract from the command line, you can then use
tesseract myimage.png output -l sin
or
tesseract myimage.png output -l Sinhala
I took a screenshot of the Sinhala script Wikipedia page, and cropped the following part:
Both above commands result in the following output:
සිංහල අක්ෂර මාලාව
That seems fine to me, but I don't claim to be able to read or understand any Sinhala script or language!
So, in general: Yes, it seems, you can OCR Sinhala texts!
BUT: As for any script, and maybe even more difficult for non-Latin scripts, you probably won't get good results on handwritten texts. OCR on those texts is some field of research on its own.

Bourne Shell delete oldest file with DOY extension

I am relatively new to bourne scripting (running on Solaris), and I am struggling with this simple problem for some reason. I am creating a script that will run in a directory, and will try to delete the files older than a certain date.
The files are of the form: log.DOY, so for example log.364, log.365, log.001, etc.
Now this would be easy if it wasn't for the pesky rollover, especially with it not always being 365 as a max (leap years).
I have debated using find -mtime, but it would be preferable to use the file extension if possible.
Do any of you scripting magicians have any suggestions?
Your choice of find with -mtime is close, but there is a potentially easier way. You say you would like to remove files older than the date of some measuring file (say all files older than log.287 -- including log.287).
find provides the -newer option that will do just that. The following is a short script that takes the measuring filename as its first argument and will print here (but you can include delete on you own) all files in that directory (non-recursively with the -maxdepth 1 option). The printf operation is provided for testing to insure there are no "OOPs" accidents. Let me know if you have questions:
#!/bin/sh
find . -maxdepth 1 -type f ! -newer "$1" |
while read filenm; do
printf "%s\n" "$filenm" ## you can add rm to remove the file
done
Note: check your version of read. The POSIX compliant use is shown above, but if you have the -r option, I would suggest its use as well.
I don't have Solaris handy to check, but I don't think this is practical purely in shell script unless you happen to have non-standard CLI tools available (such as GNU Coreutils).
Specifically, figuring out the end-of-year wrap depends on knowing what day of the year it is right now, and I don't see a way to do that in the documentation I can find. (It can be done in GNU date using +%j as the format.)
However, the docs do say that you should have perl, so I would look to use that.

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.

What's the best way to perform a differential between a list of directories?

I am interested in looking at a list of directories and comparing the previous list with a current list of directories and setting up a script to do so. Maybe in perl or as a shell script.
Should I use something like diff? Programatically, what would be an ideal way to do this? For example let say I output the diff to an output file, if there is no diff then exit, if there is results, I want to see it.
Let's for example I have the following directories today:
/foo/bar/staging/abc
/foo/bar/staging/def
/foo/bar/staging/a1b2c3
Next day would look like this where a directory is either added, or renamed:
/foo/bar/staging/abc
/foo/bar/staging/def
/foo/bar/staging/ghi
/foo/bar/staging/a1b2c4
There might be better ways, but the way I typically do something like this is to run a find command in each directory root, and pipe the output to separate files. You can then diff the files using the diff tool of your choice. If you want to filter out certain directories or files, you can throw in some grep or grep -v commands in the pipeline, or you can experiment with options on the find command.
The other main option is to find a diff tool that offers directory/folder comparisons. Most of the goods ones support this, but I like the command line method, because you get more control over what you're diffing.
cd /my/directory/one
find . -print | sort > /temp/one.txt
cd /my/directory/two
find . -print | sort > /temp/two.txt
diff /temp/one.txt /temp/two.txt
also check the inotifywait command. it allows you to monitor files in RT.
You might also consider the find command using the -newer switch.
The usage is:
find . -newer timefile.txt -print
The -newer switch makes find return a list of files that are created or updated after the specified file's modification time. In the example above, any file created or updated after timefile.txt would be returned. You'd have to create a timefile.txt file, most likely once per day. Some versions of find have variations of newer that compare against other time stamps for a file (last modified, last accessed, last created, etc.)
This technique would not report a file that was deleted, however. A daily diff of the file listings could report that.

How to search a text among c files under a directory

I've looked through several similar questions, but either I didn't understand their answer or my question is different than theirs. So, I have a project contains many subdirecties and different type of files. I would like to search a function name among those .C files only.
Some information on the web suggest to use "Esc x dired-do-query-replace-regexp". However, this will search not just C files, but also other file like .elf which isn't helpfule in my case. Other people sugget to use TAG function, but it will require me to type "etags *.c" for every subdirectory which is also impossible.
How should I do this while working on those large scale software project?
Thanks
Lee
Use ack-grep on linux
ack-grep "keyword" -G *.c
My favorite: igrep-find, found in the package igrep.el. Usage is:
M-x igrep-find some_thing RET *.C
There's the built in grep-find, docs here, but I find it awkward to use.
For a more general answer, see this similar question: Using Emacs For Big Big Projects.
if you're on linux, you can use grep to find files with a certain text in them. you would then do this outside of emacs, in your shell/command prompt. here's a nice syntax:
grep --color=auto --include=*.c -iRnH 'string to search for' /dir/to/search/
the directory to search can be specified relative, so if you're in the directory you want to use as the root directory for your recursive search, you can just skip the whole directory address and specify a single dot.
grep --color=auto --include=*.c -iRnH 'string to search for' .
the part --color=auto makes some text highlighted. --include=*.c is the part that specifies what files to search. in this case, only files with the c-extension. the flag i makes stuff case insensitive, the flag R makes the search recursive, the flag n adds the line number to the report, and the flag H adds the file path to the report.
To breed find and grep there is find-grep function, there you can change the invocation string to find . -name *.c etc. Make it a function, if You like. Then You use eg. C-x` et al. to navigate the results.
To search among the files in one directory i use lgrep, it prompts you in which files to search.
You can use cscope and xcscope.el : http://www.emacswiki.org/emacs/CScopeAndEmacs
Try with dired: place the cursor on the directory name to search, type A and in the minibuffer the text to find.