Inotifywait for large directories - callback

In the inotifywait man changes the following is stated
-r, --recursive Watch all subdirectories of any directories passed as arguments. Watches will be set up recursively to an unlimited
depth. Symbolic links are not tra‐versed. Newly created
subdirectories will also be watched.
Warning: If you use this option while watching the root directory of
a large tree, it may take quite a while until all inotify watches
are established, and events will not be received in this time.
Also, since one inotify watch will be established per subdirectory, it
is possible that the maximum amount
of inotify watches per user will be reached. The default maximum is 8192; it can be increased by writing to
/proc/sys/fs/inotify/max_user_watches.
I take this to mean that every time inotifywait is called, there is a delay for large directories. Therefore, constantly monitoring a large directory with monitor function like so
inotifywait -m /home/user/Documents
is more efficient than manually looping through the directory like so (from an example in the man pages)
while inotifywait /home/user/Documents; do
#Do Something for each file change
done
as every iteration of the while loop you have to set up inotifywait again. But with the first option, I can't execute based on the return. Ideally what I want is a callback function like so
inotifywait -m --callback ./callback.sh /home/user/Documents
so callback.sh gets called each time with the return value of inotifywait. How would I implement this?

You can pipe it like:
inotifywait -m /my/directory | while read LINE; do ./do_something.sh $LINE; done
Keep in mind that you get many events for certain operations, each of which will trigger the launch of your script.
You can also use perl or some other language to use the API directly, which gives you tons of flexibility.

Related

Reduce relocatable win32 Perl to as few files and bytes as possible

I'm trying to use a perl program on a Windows HTCondor computing cluster. The way HTCondor on windows works is it copies all dependencies into a temporary directory (used as a chroot of sorts) and then it deletes the directory after the specified outputs are moved to a designated place.
If I take only perl.exe and perl514.dll and make a job like this: perl -e "print qq/hello\n/" and tell the cluster to run it 200 times, then each replication winds up taking about 15 seconds, which is acceptable overhead. That's almost all time spent repeatedly copying the files over the network and then deleting them. echo_hello.bat run 200 times takes more like two seconds per replication.
The problem I have is that when I try to use my full blown perl distribution of 55MB and 2,289 files, a single "hello" rep takes something like four minutes of copying and deleting, which is unacceptable. When I try to do many runs the disks on the machines grind to a halt trying to concurrently handle all the file operations across all the reps, so it doesn't work at all. I don't know how long it might take to eventually finish because I gave up after half an hour and no jobs had finished.
I figured PAR::Packer might fix the issue, but nope. I tried print_hello.exe created like this: pp -o print_hello.exe -e "print qq/hello\n/". It still makes things grind to a halt, apparently by swamping the filesystem. I think a PAR::Packer executable makes a ton of temporary files as it pulls out files it needs from the archive. I think the windows file system totally chokes when there are a bunch of concurrent small file operations.
So how can I go about cutting down the perl I built to something like 6MB and a dozen files? I'm really only using a tiny number of core modules and don't need most of the crap in bin and lib, but I have no idea how to proceed ripping out stuff in a sane way.
Is there an automated way to strip away un-needed files and modules?
I know TCL has a bunch of facilities for packing files into a single uncompressed archive that can then be accessed through a "virtual filesystem" without expanding the file. Is there some way to do this with perl itself sort of like with PAR? The problem is PAR compresses everything and then has to extract to temporary files, rather than directly work through a virtual filesystem layer. (If I understand correctly.)
My usage of perl is actually as a scripting layer. It's embedded in a simulation. So I'm really running my_simulation.exe which depends on per514.dll, but you get the idea. I also cannot realistically do anything to the HTCondor cluster other than use it. So there's no need to think outside the box on what I should be using instead of perl and what I could administratively tweak in Windows and HTCondor, thanks.
You can use Module::ScanDeps to get list of actual dependencies of your perl. It was terrible, that it took significant amount of time, when PAR::Packer unpacked the whole application, so I decided to build the executable by myself.
Here is my ready to use script which gathers perl dependencies into some directory; it might be useful for you to reduce the number of perl-modules, e.g. by manually removing some dependencies after copying.
In theory (I have never tried that), the next your step could be merge all pure-perl dependencies into single file (like deps.pm); although it might be non-trivial due to perl's autoload magic and some other tricks.
You can list the modules that are needed by your program using the very nice ListDependencies module
To my knowledge it isn't downloadable anywhere, but it is simple to copy and paste into your own ListDependencies.pm file
You should read the POD documentation within the module for usage instructions

Copying constantly changing directory

I am trying to copy files from a directory that is in constant use by a security cam program. To archive these .jpg files to another HD, I first need to copy them. The problem is, the directory is being filled as the copying proceeds at the rate of about 10 .jpgs per second. I have the option of stopping the program, do the copy then start it again which is not what I want to do for many reasons. Or I could do the find/mtime approach. I have tried the following:
find /var/cache/zm/events/* -mmin +5 -exec cp -r {} /media/events_cache/ \;
Which under normal circumstances would work. But it seems the directories are also changing their timestamps and branch off in different directions so it never comes out logically and for some reason each directory is very deep like /var/cache/zm/events/../../../../../../../001.jpg x 3000. All I want to do is copy the files and directories via cron with a simple command line if possible. With the directories constantly changing, is there are way to make this copy without stopping the program?
Any insight would be appreciated.
rsync should be a better option in this case but you will need to try it out. Try setting it up at off peak hours when the traffic is not that high.
Another option would be setting up the directory on a volume which uses say mirroring or RAID 5 ; this way you do not have to worry about losing data (if that indeed is your concern).

Automatically running a script to read particular information from a .txt file ? (Perl Script, or suggest)

My scenario: A text file(s) will keep coming into say a folder, I need to detect the new text file, and read particular information from it, say format being (word : info, OR word and under it a column of info, etc.). And, this process needs to keep going on always.
Problem: How should I go about doing this, I guess use perl scipt, but where to go from there ?, I am getting ideas, and also help on the internet, but I thought asking it here might make my thoughts clearer.
Kindly help, please suggest a path to do this.
Regards,
Chirayu
First thing: you want a daemon process, so you may want to have a look at Proc::Daemon.
Second thing, you need to read & parse your file. Parsing it, depends on its format, and your question is not really clear about it.
Finally, you may want to consider moving a newly detected file (or renaming it) while processing it, end then (possibly) deleting it after having processed. This depends on the requirements that you have. Alternatively, you may want to move the newly detected file into an archive directory after having processed them.
One approach might be to have a perl process that regularly (say every 5 seconds, every 5 minutes or every 5 hours, your call really) scans said directory and as soon as any new text file appears, spawn a child process that process it.
The child process might be another perl script which gets the name of the text file as it's argument and which reads the file, detects the word you mention and then extracts the information you are interested in (and then does whatever you consider necessary with that information).
Things to look out for is what to do with the text files once they are processed. Are they supposed to stay around? Then you need to keep track of which of them you have processed, so they do not get processed again in the case your master process (the one that scans the directory and spawn perl children) has to be restarted (due to either a crash or a deliberate restart).
If the text files are supposed to disappear once they are processed, then I assume it could be a good idea to either let the children remove them after completion or to let the master process remove them provided the master process always waits for the children to complete before it continues running. The drawback with a master process waiting for children to complete is that children then cannot be run in parallell but has to be run in strict sequence (not necessary a drawback depending on your situation).
(If you have a master process always waiting for the child process to run, you can actually skip having child processes altogether and create a subroutine in the master program which reads and processes the text file).
High level description but hope it helps.
What is the operating system you are using?
On Windows, you can use Win32::ChangeNotify and on Linux, you can use Linux::Inotify2 to be notified of changes to the contents of a directory.
Your script can simply wait to be notified and take action when notified instead of polling the contents of the directory which will either waste resources or potentially miss some changes.

What's the best way to perform a differential between a list of directories?

I am interested in looking at a list of directories and comparing the previous list with a current list of directories and setting up a script to do so. Maybe in perl or as a shell script.
Should I use something like diff? Programatically, what would be an ideal way to do this? For example let say I output the diff to an output file, if there is no diff then exit, if there is results, I want to see it.
Let's for example I have the following directories today:
/foo/bar/staging/abc
/foo/bar/staging/def
/foo/bar/staging/a1b2c3
Next day would look like this where a directory is either added, or renamed:
/foo/bar/staging/abc
/foo/bar/staging/def
/foo/bar/staging/ghi
/foo/bar/staging/a1b2c4
There might be better ways, but the way I typically do something like this is to run a find command in each directory root, and pipe the output to separate files. You can then diff the files using the diff tool of your choice. If you want to filter out certain directories or files, you can throw in some grep or grep -v commands in the pipeline, or you can experiment with options on the find command.
The other main option is to find a diff tool that offers directory/folder comparisons. Most of the goods ones support this, but I like the command line method, because you get more control over what you're diffing.
cd /my/directory/one
find . -print | sort > /temp/one.txt
cd /my/directory/two
find . -print | sort > /temp/two.txt
diff /temp/one.txt /temp/two.txt
also check the inotifywait command. it allows you to monitor files in RT.
You might also consider the find command using the -newer switch.
The usage is:
find . -newer timefile.txt -print
The -newer switch makes find return a list of files that are created or updated after the specified file's modification time. In the example above, any file created or updated after timefile.txt would be returned. You'd have to create a timefile.txt file, most likely once per day. Some versions of find have variations of newer that compare against other time stamps for a file (last modified, last accessed, last created, etc.)
This technique would not report a file that was deleted, however. A daily diff of the file listings could report that.

How to detect changing directory size in Perl

I am trying to find a way of monitoring directories in Perl, in particular the size of a directory, and upon detecting a change in directory size, perform a particular action.
The issue I have is with large files that require a noticeable amount of time to copy into this directory, i.e. > 100MB. What happens (in Windows, not Unix) is the system reserves enough disk space for the entire file, even though the file is still copying in progress. This causes problems for me, because my script will try to perform an action on this file that has not finished copying over. I can easily detect directory size changes in Unix via 'du', but 'du' in Windows does not behave the same way.
Are there any accurate methods of detecting directory size changes in Perl?
Edit: Some points to clarify:
- My Perl script is only monitoring a particular directory, and upon detecting a new file or a new directory, perform an action on this new file or directory. It is not copying any files; users on the network will be copying files into the directory I am monitoring.
- The problem occurs when a new file or directory appears (copied, not moved) that is significantly large (> 100MB, but usually a couple GB) and my program fires before this copy completes
- In Unix I can easily 'du' to see that the file/directory in question is growing in size, and take the appropriate action
- In Windows the size is static, so I cannot detect this change
- opendir/readdir/closedir is not feasible, as some of the directories that appear may contain thousands of files, and I want to avoid the overhead of
Ideally I would like my program to be triggered on change, but I am not sure how to do this. As of right now it busy waits until it detects a change. The change in file/directory size is not in my control.
You seem to be working around the underlying issue rather than addressing it -- your program is not properly sending a notification when it is finished copying a file. Why not do that instead of using OS-specific mechanisms to try to indirectly determine when the operation is complete?
You can use Linux::Inotify2 or Win32::ChangeNotify to detect directory/file changes.
EDIT: File::ChangeNotify seems a better option (cross-platform & used by Catalyst)
As I understand it, you are polling a directory with thousands of files. When you see a new file, there is an action that is taken on the file. This causes problems if the file is in use or still being copied, correct?
There are potentially several solutions:
1) Use flock to detect if the file is still in use by another process (test if it works properly on your OS, file system, and Perl version).
2) Use a LockFile call on Windows. If it fails, the OS or another process is using that file.
3) Change the poll interval to a non busy time on the server and take the directory off line while your process completes.
Evaluating the size of a directory is something all but the most inexperienced Perl programmers should be able to do. You can write your own portable version of du in 15 lines of code if you know about:
Either glob or opendir / readdir / closedir to iterate through the files in a directory
The filetest operators (-f file, -d file, etc.) to distinguish between regular files and directory names
The stat function or file size operator -s file to obtain the size of a file
There is a nice module called File::Monitor, it will detect new files, deleted files, changes in size and any other attribute that can be done with stat. It will then go and out put the files for you.
http://metacpan.org/pod/File::Monitor
You set up a baseline scan, then set up a call back for each item you are looking for, so new changes you can see via
$monitor->watch( {
name => 'somedir',
recurse => 1,
callback => {
files_created => sub {
my ($name, $event, $change) = #_;
# Do stuff
}
}
} );
If you need to go deeper than one level just do it to whatever level you need. After this is done and it finds new files you can trigger you application to do what you want on the files.