I am writing a Perl script (in Windows) that is using File::Find to index a network file system. It works great, but it takes a very long time to crawl the file system. I was thinking it would be nice to somehow get a checksum of a directory before traversing it, and it the checksum matches the checksum that was taken on a previous run, do not traverse the directory. This would eliminate a lot of processing, since the files on this file system do not change often.
On my AIX box, I use this command:
csum -h MD5 /directory
which returns something like this:
5cfe4faf4ad739219b6140054005d506 /directory
The command takes very little time:
time csum -h MD5 /directory
5cfe4faf4ad739219b6140054005d506 /directory
real 0m0.00s
user 0m0.00s
sys 0m0.00s
I have searched CPAN for a module that will do this, but it looks like all the modules will give me the MD5sum for every file in a directory, not for the directory itself.
Is there a way to get the MD5sum for a directory in Perl, or even in Windows for that matter as I could call a Win32 command from Perl?
Thanks in advance!
Can you just read the last modified dates of the files and folders? Surely that's going to be faster than building MD5's?
In order to get a checksum you must read the files, this means you will need to walk the filesystem, which puts you back in the same boat you are trying to get out of.
From what I know you cannot get an md5 of a directory. md5sum on other systems complains when you provide a directory. csum is most likely giving you a hash of the directory file contents of the top level directory, not traversing the tree.
You can grab the modified times for the files and hash them how you like by doing something like this:
sub dirModified($){
my $dir = #_[0];
opendir(DIR, "$dir");
my #dircontents = readdir(DIR);
closedir(DIR);
foreach my $item (#dircontents){
if( -f $item ){
print -M $item . " : $item - do stuff here\n";
} elsif( -d $item && $item !~ /^\.+$/ ){
dirModified("$dir/$item");
}
}
}
Yes it will take some time to run.
In addition to the other good answers, let me add this: if you want a checksum, then please use a checksum algorithm instead of a (broken!) hash function.
I don't think you don't need a cryptographically secure hash function in your file indexer -- instead you need a way to see if there are changes in the directory listings without storing the entire listing. Checksum algorithms do that: they return a different output when the input is changed. They might do it faster since they are simpler than hash functions.
It is true that a user could change a directory in a way that wouldn't be discovered by the checksum. However, a user would have to change the file names like this on purpose since normal changes in file names will (with high probability) give different checksums. Is it then necessary to guard against this "attack"?
One should always consider the consequences of each attack and choose the appropriate tools.
I did one of these in python if your interested:
http://akiscode.com/articles/sha-1directoryhash.shtml
Related
I have a perl script that I wrote that gets some image URLs, puts the urls into an input file, and proceeds to run wget with the --input-file option. This works perfectly... or at least it did as long as the image filenames were unique.
I have a new company sending me data and they use a very TROUBLESOME naming scheme. All files have the same name, 0.jpg, in different folders.
for example:
cdn.blah.com/folder/folder/202793000/202793123/0.jpg
cdn.blah.com/folder/folder/198478000/198478725/0.jpg
cdn.blah.com/folder/folder/198594000/198594080/0.jpg
When I run my script with this, wget works fine and downloads all the images, but they are titled 0.jpg.1, 0.jpg.2, 0.jpg.3, etc. I can't just count them and rename them because files can be broken, not available, whatever.
I tried running wget once for each file with -O, but it's embarrassingly slow: starting the program, connecting to the site, downloading, and ending the program. Thousands of times. It's an hour vs minutes.
So, I'm trying to find a method to change the output filenames from wget without it taking so long. The original approach works so well that I don't want to change it too much unless necessary, but i am open to suggestions.
Additional:
LWP::Simple is too simple for this. Yes, it works, but very slowly. It has the same problem as running individual wget commands. Each get() or get_store() call makes the system re-connect to the server. Since the files are so small (60kB on average) with so many to process (1851 for this one test file alone) that the connection time is considerable.
The filename i will be using can be found with /\/(\d+)\/(\d+.jpg)/i where the filename will simply be $1$2 to get 2027931230.jpg. Not really important for this question.
I'm now looking at LWP::UserAgent with LWP::ConnCache, but it times out and/or hangs on my pc. I will need to adjust the timeout and retry values. The inaugural run of the code downloaded 693 images (43mb) in just a couple minutes before it hung. Using simple, I only got 200 images in 5 minutes.
use LWP::UserAgent;
use LWP::ConnCache;
chomp(#filelist = <INPUTFILE>);
my $browser = LWP::UserAgent->new;
$browser->conn_cache(LWP::ConnCache->new());
foreach(#filelist){
/\/(\d+)\/(\d+.jpg)/i
my $newfilename = $1.$2;
$response = $browser->mirror($_, $folder . $newfilename);
die 'response failure' if($response->is_error());
}
LWP::Simple's getstore function allows you to specify a URL to fetch from and the filename to store the data from it in. It's an excellent module for many of the same use cases as wget, but with the benefit of being a Perl module (i.e. no need to outsource to the shell or spawn off child processes).
use LWP::Simple;
# Grab the filename from the end of the URL
my $filename = (split '/', $url)[-1];
# If the file exists, increment its name
while (-e $filename)
{
$filename =~ s{ (\d+)[.]jpg }{ $1+1 . '.jpg' }ex
or die "Unexpected filename encountered";
}
getstore($url, $filename);
The question doesn't specify exactly what kind of renaming scheme you need, but this will work for the examples given by simply incrementing the filename until the current directory doesn't contain that filename.
I think I've read how to do this somewhere but I can't find where. Maybe it's only possible in new(ish) versions of Perl. I am using 5.14.2:
I have a Perl script that writes down results into a file if certain criteria are met. It's more logical given the structure of the script to write down the results and later on check if the criteria to save the results into a file are met.
I think I've read somewhere that I can write content into a filehandle, which in Linux I guess will correspond to a temporary file or a pipe of some sorts, and then give the name to that file, including the directory where it should be, later on. If not, the content will be discarded when the script finishes.
Other than faffing around temporary files and deleting them manually, is there a straightforward way of doing this in Perl?
There's no simple (UNIX) facility for what you describe, but the behavior can be composed out of basic system operations. Perl's File::Temp already does most of what you want:
use File:Temp;
my $tmp = File::Temp->new; # Will be unlinked at end of program.
while ($work_to_do) {
print $tmp a_lot_of_stuff(); # $tmp is a filehandle
}
if ($save_it) {
rename($tmp, $new_file); # $tmp is also a string. Move (rename) the file.
} # If you need this to work across filesystems, you
# might want to ``use File::Copy qw(move)'' instead.
exit; # $tmp will be unlinked here if it was not renamed
I use File::Temp for this.
But you should have in mind that File::Temp deletes the file by default. That is OK but in my case I don't want that when debugging. If the script terminates and the output is not the desired one I can not check the temp file.
So I prefer to set $KEEP_ALL=1 or $fh->unlink_on_destroy( 0 ); when OO or ($fh, $filename) = tempfile($template, UNLINK => 0); and then unlink the file myself or move to a proper place.
It would be safer to move the file after closing the filehandle (just in case there is some buffering going on). So I would prefer an approach where temp file is not deleted by default and then when all is done, set a conditional that either delete it or move it to your desired place and name.
I'm debugging a piece of code which uses the Perl '-s' function to get the size of some files.
my $File1 = /myfolder/.../mysubfolder1/document.pdf
my $File2 = /myfolder/.../mysubfolder2/document.pdf
my $File3 = /myfolder/.../mysubfolder1/document2.pdf ($File3 is actually a link to /myfolder/.../mysubfolder2/document.pdf aka $File2)
The code which is buggy is:
my $size = int((-s $File)/1024);
Where $File is replaced with $File1 - $File3.
For some reasons I can't explain this does not work on every file.
For $File1 and $File3 it works but not for $File2. I could understand if both $File2 and $File3 would not work, it would mean that the file /myfolder/.../mysubfolder2/document.pdf is somehow corrupt.
I even added a test if (-e $File)|{ before the -s to be sure the file exists, but the three files do exist.
There is an even more strange thing: there is an .htaccess in /myfolder/.../mysubfolder1/ but no .htaccess in /myfolder/.../mysubfolder2/. If it was inverse I would think the .htaccess would block the -s call somehow.
Any thoughts?
If -s fails, it returns undef and sets the error in $!. What is $!?
I suppose that if you check the size of that file with "stat" you will get something less than 1024 bytes :)
your int((-s $fn)/1024) will return 0 if size is less than 1024
To address the end of your comment, .htaccess file controls access to files by a web server's request. Once the user requests a URL which executes a valid permissible CGI/whatever script (I'm assuming yoour Perl code is in web context), THAT script has absolutely no permissioning issues regarding .htaccess (unless you actually code your Perl to read its contents and respect them explicitly by hand).
The only permissioning that can screw up your Perl file is the file system permissions in your OS.
To get the file size, your web user needs:
Execute permission on the directory containing the file
Possibly, read permission on the directory containing the file (not sure if the file size is stored in the inode?)
Possibly, read permission on the file iteself.
If all your 3 files (2 good and 1 bad) are in the same directory, check the file's read permissions.
If they are in different directories, check the file's read perms AND directory perms.
Change int((-s $file)/1024) to sprintf('%.0f', (-s $file)/1024) - you'll see something then, the file is probably under 1024 bytes, so the int() will happily return 0.
Apache version 2.2.11 (Unix)
Architecture x86_64
Operating system Linux
Kernel version 2.6.18-164.el5
Ok, here is what I have working. However, I may not be using File::Util for anything else in the rest of the script.
My directory names are 8 digits starting at 10000000 .
I was comparing the highest found number with stat last created as a double check but, overkill I believe.
Another issue is that I did not know how to slap a regex in the list_dir command so only 8 digits eg m!^([0-9]{8})\z!x) could reside in that string. Reading the man, the example reads ....'--pattern=\.txt$') but, my futile attempt: '--pattern=m!^([0-9]{8})\z!x)') well, was just that.
So, would there be a "better" way to grab the latest folder/directory?
use File::Util;
my($f) = File::Util->new();
my(#dirs) = $f->list_dir('/home/accountname/public_html/topdir','--no-fsdots');
my #last = (sort { $b <=> $a } #dirs);
my $new = ($last[0]+1);
print "Content-type: text/html\n\n";
print "I will now create dir $new\n";
And.. How would I ignore anything not matching my regex?
I was thinking an answer may reside in ls -d as well but, as a beginner here, I am new to system calls from a script (and if in fact that's what that would be? ;-) ).
More specifically:
Best way to open a directory, return the name of the latest 8 digit directory in that directory ignoring all else. Increase the 8 digit dir name by 1 and create the new directory.
Whichever is most efficient: stat or actual 8 digit file name. (directory names are going to be 8 digits either way.) Better to use File::Util or just built in Perl calls?
What are you doing? It sounds really weird and fraught with danger. I certainly wouldn't want to let a CGI script create new directories. There might be a better solution for what you are trying to achieve.
How many directories do you expect to have? The more entries you have in any directory, the slower things are going to get. You should work out a scheme where you can hash things into a directory structure that spreads out the files so no directory holds that many items. Say, it you have the name '0123456789', you create the directory structure like:
0/01/0123456789
You can have as many directory levels as you like. See the directory structure of CPAN, for instance. My author name is BDFOY, so my author directory is authors/id/B/BD/BDFOY. That way there isn't any directory that has a large number of entries (unless your author id is ADAMK or RJBS).
You also have a potential contention issue to work out. Between the time you discover the latest and the time you try to make the next one, you might already create the directory.
As for the task at hand, I think I'd punt to system for this one if you are going to have a million directories. With something like:
ls -t -d -1 [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -1
I don't think you'll be able to get any faster than ls for this task. If there are a large number of directories, the cost of the fork should be outweighed by the work you have to do to go through everything yourself.
I suspect, however, that what you really need is some sort of database.
Best way to open a directory, return the name of the latest 8 digit directory in that directory ignoring all else. Increase the 8 digit dir name by 1 and create the new directory. Whichever is most efficient: stat or actual 8 digit file name?
First, I should point out that having about 100,000,000 subdirectories in a directory is likely to be very inefficient.
How do you get only the directory names that consist of eight digits?
use File::Slurp;
my #dirs = grep { -d and /\A[0-9]{8}\z/ } read_dir $top;
How do you get the largest?
use List::Util qw( max );
my $latest = max #dirs;
Now, the problem is, between the determination of $latest and the attempt to create the directory, some other process can create the same directory. So, I would use $latest as the starting point and keep trying to create the next directory until I succeed or run out of numbers.
#/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use File::Spec::Functions qw( catfile );
use List::Util qw( max );
sub make_numbered_dir {
my $max = 100_000_000;
my $top = '/home/accountname/public_html/topdir';
my $latest = max grep { /\A[0-9]{8}\z/ } read_dir $top;
while ( ++$latest < $max ) {
mkdir catfile($top, sprintf '%8.8d', $latest)
and return 1;
}
return;
}
If you try to do it the way I originally recommended, you will invoke mkdir way too many times.
As for how you use File::Util::list_dir to filter entries:
#/usr/bin/perl
use strict;
use warnings;
use File::Util;
my $fu = File::Util->new;
print "$_\n" for $fu->list_dir('.',
'--no-fsdots',
'--pattern=\A[0-9]{8}\z'
);
C:\Temp> ks
10001010
12345678
However, I must point out that I did not much like this module in the few minutes I spent with it, especially the module author's obsession with invoking methods and functions in list context. I do not think I will be using it again.
I have written a Perl script which opens a directory consisting of various files. It seems that the script does not read the files in any sequential order (neither alphabetically nor size wise) instead it reads them randomly. I was wondering what could be the reason behind the same?
It's never random, it's just in a pattern that you don't recognize. If you look at the documentation that describes the implementation of whatever function you're using to read the directory, it would probably say something like, does not guarantee the order of the files to be read.
If you need them in a specific order, sort the names before you operate on them.
The files are probably read in an order that's convenient for the underlying file system. So, in a sense, the files are ordered, but not in an order you expect (size or alphabetical). Sometimes, files have an internal numerical id, and the files may be returned in numerical order given this id. But this id is something you probably won't encounter often if ever.
Again, the results are ordered, not random. They're just in an order that you're not expecting. If you need them ordered, order them explicitly.
See also: http://www.perlmonks.org/?node_id=175864
It's probably reading them according to the order they're stored in the directory's list of files. On certain Unix-like filesystems, the directory is essentially an unordered list of filenames and inodes that point to the contents (this is tremendously simplified).
Directory entries are not stored in sorted order and you should not assume they're stored that way. If you want to sort them, you have to sort them. For example, compare the output of:
perl -e 'opendir DIR, "."; print join("\n", sort readdir(DIR)); print "\n";'
perl -e 'opendir DIR, "."; print join("\n", readdir(DIR)); print "\n";'
If your script uses opendir() (directly or indirectly), you cannot assume anything about the ordering ordering of the files it returns; it will depend on the OS and type of filesystem you are accessing. A couple of options are:
use two loops: one to read all the filenames, the second to process them in the order you require.
use some other command (like invoking "ls") to force the filenames to be returned in the order you require.