How to use multiple files at once using bash - perl

I have a perl script which is used to process some data files from a given directory. I have written below bash script to look for the last updated file in the given directory and process that file.
cd $data_dir
find \( -type f -mtime -1 \) -exec ./script.pl {} \;
Sometimes, user copied multiple files to the data dir and hence the previous one skipped. The perl script execute only the last updated file. Can you please suggest me how to fix this using bash script.

Try
cd $data_dir
find \( -type f -mtime -1 \) -exec ./script.pl {} +
Note the termination of -exec with a + vs your \;
From the man page
-exec command {} +
This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end;
Now that you'll have one or more file names passed into your perl script, you can alter your perl script to iterate over each passed in file name.

If I understood the question correctly, you need to process any files that were created or modified in a directory since the last time your script was run.
In my opinion find is not the right tool to determine those files, because it has no notion of which files it has already seen.
Using any of the -atime/-ctime/-mtime options will either produce duplicates if you run your script twice in the specified period, or miss some files if it is not executed at the right time. The timing intricacies of using these options for something like this are not easy to deal with.
I can propose a few alternatives:
a) Use three directories instead of one: incoming/ processing/ done/. Your users should only be allowed to put files in incoming/. You move any files in there to processing/ with a simple mv incoming/* processing/ before running your perl script. Then you move them from processing/ to done/ when its over.
In my opinion this is the simplest and best solution, and the one used by mail servers etc when dealing with this issue. If I were you and there were not any special circumstances preventing you from doing this, I'd stop reading here.
b) Have your finder script touch a special file (e.g. .timestamp, perhaps in a different directory, so that your users will not tamper with it) when it's done. This will allow your script to remember the last time it was run. Then use
find \( -cnewer .timestamp -o -newer .timestamp \) -type f -exec ./script.pl '{}' ';'
to run your perl script for each file. You should modify your perl script so that it can run repeatedly with a different file name each time. If you can modify it to accept multiple files in one go, you can also run it with
find \( -cnewer .timestamp -o -newer .timestamp \) -type f -exec ./script.pl '{}' +
which will minimise the number of ./script.pl processes. Take care to handle the first run of the find script, when the .timestamp file is missing. A good solution would be to simply ignore it by not using the -*newer options at all in that case. Also keep in mind that there is a race condition where files added after find was started but before touching the timestamp file will not be processed.
c) As a variation of (b), have your script update the timestamp with the time of the processed file that was created/modified most recently. This is tricky, because find cannot order its output on its own. You could use a wrapper around your perl script to handle this:
#!/bin/bash
for i in "$#"; do
find "$i" \( -cnewer .timestamp -o -newer .timestamp \) -exec touch -r '{}' .timestamp ';'
done
./script.pl "$#"
This will update the timestamp if it is called to process a file with a newer mtime or ctime, minimising (but not eliminating) the race condition. It is however somewhat awkward - unavoidable since bash's [[ -nt option seems to only check the mtime. It might be better if your perl script handled that on its own.
d) Have your script store each processed filename and its timestamps somewhere and then skip duplicates. That would allow you to just pass all files in the directory to it and let it sort out the mess. Kinda tricky though...
e) Since your are using Linux, you might want to have a look at inotify and the inotify-tools package - specifically the inotifywait tool. With a bit of scripting it would allow you to process files as they are added in the directory:
inotifywait -e MOVED_TO -e CLOSE_WRITE -m -r testd/ | grep --line-buffered -e MOVED_TO -e CLOSE_WRITE | while read d e f; do ./script.pl "$f"; done
This has no race conditions, as long as your users do not create/copy/move any directories rather than just files.

The perl script will only execute against the file which find gives it. Perhaps you should remove the -mtime -1 option from the find command so that it picks up all the files in the directory?

Related

Using xargs arguments twice

I need to check if local file is same as remote host file.
The file locations are like below:
File1 at Local machine
./remotehostname/home/a/b/scripts/xyz.cpp
File2 at remote machine
remotehostname:/home/a/b/scripts/xyz.cpp
I intend to compare these 2 files, using the command
diff ./remotehostname/home/a/b/scripts/xyz.cpp remotehostname:/home/a/b/scripts/xyz.cpp
find . -type f | grep -v .svn |xargs -I % diff %
I need to change % to take remotehost and compare the file.
Not sure how to apply sed on %. Or is there a better way to compare such files.
One way could be to save the list of files and then apply sed on that file, but I think there should be an even better way. Also the diff doesnt work on remote hosts, maybe I need to use output of dry rsync?
This can be done with xargs, but I prefer to use while read in bash.
xargs method
find . -type f | grep -v .svn | sed 's/.*/& remotehostname:&/' | xargs -n2 diff
The sed command duplicates the input and makes whatever modifications you need. The xargs then passes the inputs to diff two at a time. This will not work if any filename contain spaces.
bash method
find . -type f | grep -v .svn | while read line; do
diff "$line" "remotehostname:$line"
done
The bash read command reads a line from stdin, places it in the name variable, $line, and returns true. You can then put whatever you like inside the loop, so you get total freedom to rewrite the filename however you need. When the input runs out, read returns false, and the loop exits.
Note that piping things into loops has some interesting side effects that are not relevant here, but might bite you one day.
If you are interested in the actual difference (and not just whether they differ - which rsync is brilliant for telling you) then you can do this using GNU Parallel:
find . -type f | grep -v .svn |
parallel diff {} '<(ssh {= s:./::;s:/.*:: =} cat {= s:([^/]+/){2,2}::;$_=::shell_quote_scalar($_) =})'
s:./::;s:/.*:: = hostname from path
s:([^/]+/){2,2}:: = rest of path
::shell_quote_scalar = \-quote special chars as needed by the shell
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Create symbolic link from find

I'm trying to create a symbolic link (soft link) from the results of a find command. I'm using sed to remove the ./ that precedes the file name. I'm doing this so I can paste the file name to the end of the path where the link will be saved. I'm working on this with Ubuntu Server 8.04.
I learned from this post, which is kind of the solution to my problem but not quite-
How do I selectively create symbolic links to specific files in another directory in LINUX?
The resulting file name didn't work, though, so I started trying to learn awk and then decided on sed.
I'm using a one-line loop to accomplish this. The problem is that the structure of the loop is separating the filename, creating a link for each word in the filename. There are quite a few files and I would like to automate the process with each link taking the filename of the file it's linked to.
I'm comfortable with basic bash commands but I'm far from being a command line expert. I started this with ls and awk and moved to find and sed. My sed syntax could probably be better but I've learned this in two days and I'm kind of stuck now.
for t in find -type f -name "*txt*" | sed -e 's/.//' -e 's$/$$'; do echo ln -s $t ../folder2/$t; done
Any help or tips would be greatly appreciated. Thanks.
Easier:
Go to the folder where you want to have the files in and do:
find /path/with/files -type f -name "*txt*" -exec ln -s {} . ';'
Execute your for loop like this:
(IFS=$'\n'; for t in `find -type f -name "*txt*" | sed 's|.*/||'`; do ln -s $t ../folder2/$t; done)
By setting the IFS to only a newline, you should be able to read the entire filename without getting splitted at space.
The brackets are to make sure the loop is executed in a sub-shell and the IFS of the current shell does not get changed.

Grep data and output to file

I'm attempting to extract data from log files and organise it systematically. I have about 9 log files which are ~100mb each in size.
What I'm trying to do is: Extract multiple chunks from each log file, and for each chunk extracted, I would like to create a new file and save this extracted data to it. Each chunk has a clear start and end point.
Basically, I have made some progress and am able to extract the data I need, however, I've hit a wall in trying to figure out how to create a new file for each matched chunk.
I'm unable to use a programming language like Python or Perl, due to the constraints of my environment. So please excuse the messy command.
My command thus far:
find Logs\ 13Sept/Log_00000000*.log -type f -exec \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/p' {} \; | \
grep -v -A1 -B1 "Starting chunk" > Logs\ 13Sept/Chunks/test.txt
The LRE Starting chunk and LRE Ending chunk are my boundaries. Right now my command works, but it saves all matched chunks to one file (whose size is becoming exessive).
How do I go about creating a new file for each match and add the matched content to it? keeping in mind that each file could hold multiple chunks and is not limited to one chunk per file.
Probably need something more programmable than sed: I'm assuming awk is available.
awk '
/LRE Ending chunk/ {printing = 0}
printing {print > "chunk" n ".txt"}
/LRE Starting chunk/ {printing = 1; n++}
' *.log
Try something like this:
find Logs\ 13Sept/Log_00000000*.log -type f -print | while read file; do \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/p' "$file" | \
grep -v -A1 -B1 "Starting chunk" > "Logs 13Sept/Chunks/$file.chunk.txt";
done
This loops over the find results and executes for each file and then create one $file.chunk.txt for each of the files.
Something like this perhaps?
find Logs\ 13Sept/Log_00000000*.log -type f -exec \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/{;/LRE .*ing chunk/d;w\
'"{}.chunk"';}' {} \;
This uses sed's w command to write to a file named (inputfile).chunk. If that is not acceptable, perhaps you can use sh -c '...' to pass in a small shell script to wrap the sed command with. (Or is a shell script also prohibited for some reason?)
Perhaps you could use csplit to do the splitting, then truncate the output files at the chunk end.

run program multiple times using one line shell command

I have the following gifs on my linux system:
$ find . -name *.gif
./gifs/02.gif17.gif
./gifs/fit_logo_en.gif
./gifs/halloween_eyes_63.gif
./gifs/importing-pcs.gif
./gifs/portal.gif
./gifs/Sunflower_as_gif_small.gif
./gifs/weird.gif
./gifs2/00p5dr69.gif
./gifs2/iss013e48788.gif
...and so on
What I have written is a program that converts GIF files to BMP with the following interface:
./gif2bmp -i inputfile -o outputfile
My question is, is it possible to write a one line command using xargs, awk, find etc. to run my program once for each one of these files? Or do I have to write a shell script with a loop?
For that kind of work, it may be worth looking at find man page, especially the -exec option.
You can write something along the line of:
find . -name *.gif -exec gif2bmp -i {} -o {}.bmp \;
You can play with combinations ofdirname and basename to obtain better naming for the output file, though in this case, I would prefer to use a shell for loop, something like:
for i in `find . -name "*.gif"`; do
DIR=`dirname $i`
NAME=`basename $i .gif`
gif2bmp -i $i -o ${DIR}/${NAME}.bmp
done
Using GNU Parallel you can do:
parallel ./gif2bmp -i {} -o {.}.bmp ::: *.gif
The added benefit is that it will run one job for each cpu core in parallel.
Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (http://www.gnu.org/software/parallel/parallel_tutorial.html). You command line with love you for it.

Why do I have to specify the -i switch with a backup extension when using ActivePerl?

I cannot get in-place editing Perl one-liners running under ActivePerl to work unless I specify them with a backup extension:
C:\> perl -i -ape "splice (#F, 2, 0, q(inserted text)); $_ = qq(#F\n);" file1.txt
Can't do inplace edit without backup.
The same command with -i.bak or -i.orig works a treat but creates an unwanted backup file in the process.
Is there a way around this?
This is a Windows/MS-DOS limitation. According to perldiag:
You're on a system such as MS-DOS that gets confused if you try reading from a deleted (but still opened) file. You have to say -i.bak, or some such.
Perl's -i implementation causes it to delete file1.txt while keeping an open handle to it, then re-create the file with the same name. This allows you to 'read' file1.txt even though it has been deleted and is being re-created. Unfortunately, Windows/MS-DOS does not allow you to delete a file that has an open handle attached to it, so this mechanism does not work.
Your best shot is to use -i.bak and then delete the backup file. This at least gives you some protection - for example, you could opt not to delete the backup if perl exits with a non-zero exit code. Something like:
perl -i.bak -ape "splice...." file1.txt && del file1.bak
Sample with recursive modify and delete both done by find. Works on e.g. mingw git bash on windows.
$ find . -name "*.xml" -print0 | xargs -0 perl -p -i.bak -e 's#\s*<property name="blah" value="false" />\s*##g'
$ find . -name "*.bak" -print0 | xargs -0 rm
Binary terminated values passed between find/xargs to handle spaces. Unusual s/ prefix to avoid mangling xml in search term. This assumes you didn't have any .bak files hanging around to begin.