Apache version 2.2.11 (Unix)
Architecture x86_64
Operating system Linux
Kernel version 2.6.18-164.el5
Ok, here is what I have working. However, I may not be using File::Util for anything else in the rest of the script.
My directory names are 8 digits starting at 10000000 .
I was comparing the highest found number with stat last created as a double check but, overkill I believe.
Another issue is that I did not know how to slap a regex in the list_dir command so only 8 digits eg m!^([0-9]{8})\z!x) could reside in that string. Reading the man, the example reads ....'--pattern=\.txt$') but, my futile attempt: '--pattern=m!^([0-9]{8})\z!x)') well, was just that.
So, would there be a "better" way to grab the latest folder/directory?
use File::Util;
my($f) = File::Util->new();
my(#dirs) = $f->list_dir('/home/accountname/public_html/topdir','--no-fsdots');
my #last = (sort { $b <=> $a } #dirs);
my $new = ($last[0]+1);
print "Content-type: text/html\n\n";
print "I will now create dir $new\n";
And.. How would I ignore anything not matching my regex?
I was thinking an answer may reside in ls -d as well but, as a beginner here, I am new to system calls from a script (and if in fact that's what that would be? ;-) ).
More specifically:
Best way to open a directory, return the name of the latest 8 digit directory in that directory ignoring all else. Increase the 8 digit dir name by 1 and create the new directory.
Whichever is most efficient: stat or actual 8 digit file name. (directory names are going to be 8 digits either way.) Better to use File::Util or just built in Perl calls?
What are you doing? It sounds really weird and fraught with danger. I certainly wouldn't want to let a CGI script create new directories. There might be a better solution for what you are trying to achieve.
How many directories do you expect to have? The more entries you have in any directory, the slower things are going to get. You should work out a scheme where you can hash things into a directory structure that spreads out the files so no directory holds that many items. Say, it you have the name '0123456789', you create the directory structure like:
0/01/0123456789
You can have as many directory levels as you like. See the directory structure of CPAN, for instance. My author name is BDFOY, so my author directory is authors/id/B/BD/BDFOY. That way there isn't any directory that has a large number of entries (unless your author id is ADAMK or RJBS).
You also have a potential contention issue to work out. Between the time you discover the latest and the time you try to make the next one, you might already create the directory.
As for the task at hand, I think I'd punt to system for this one if you are going to have a million directories. With something like:
ls -t -d -1 [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -1
I don't think you'll be able to get any faster than ls for this task. If there are a large number of directories, the cost of the fork should be outweighed by the work you have to do to go through everything yourself.
I suspect, however, that what you really need is some sort of database.
Best way to open a directory, return the name of the latest 8 digit directory in that directory ignoring all else. Increase the 8 digit dir name by 1 and create the new directory. Whichever is most efficient: stat or actual 8 digit file name?
First, I should point out that having about 100,000,000 subdirectories in a directory is likely to be very inefficient.
How do you get only the directory names that consist of eight digits?
use File::Slurp;
my #dirs = grep { -d and /\A[0-9]{8}\z/ } read_dir $top;
How do you get the largest?
use List::Util qw( max );
my $latest = max #dirs;
Now, the problem is, between the determination of $latest and the attempt to create the directory, some other process can create the same directory. So, I would use $latest as the starting point and keep trying to create the next directory until I succeed or run out of numbers.
#/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use File::Spec::Functions qw( catfile );
use List::Util qw( max );
sub make_numbered_dir {
my $max = 100_000_000;
my $top = '/home/accountname/public_html/topdir';
my $latest = max grep { /\A[0-9]{8}\z/ } read_dir $top;
while ( ++$latest < $max ) {
mkdir catfile($top, sprintf '%8.8d', $latest)
and return 1;
}
return;
}
If you try to do it the way I originally recommended, you will invoke mkdir way too many times.
As for how you use File::Util::list_dir to filter entries:
#/usr/bin/perl
use strict;
use warnings;
use File::Util;
my $fu = File::Util->new;
print "$_\n" for $fu->list_dir('.',
'--no-fsdots',
'--pattern=\A[0-9]{8}\z'
);
C:\Temp> ks
10001010
12345678
However, I must point out that I did not much like this module in the few minutes I spent with it, especially the module author's obsession with invoking methods and functions in list context. I do not think I will be using it again.
Related
I am running Perl in Windows and I am getting a list of all the files in a directory using readdir and storing the result in an array. The first two elements in the array seem to always be "." and "..". Is this order guaranteed (assuming the operating system does not change)?
I would like to do the following to remove these values:
my $directory = 'C:\\foo\\bar';
opendir my $directory_handle, $directory
or die "Could not open '$directory' for reading: $!\n";
my #files = readdir $directory_handle;
splice ( #files, 0, 2 ); # Remove the "." and ".." elements from the array
But I am worried that it might not be safe to do so. All the solutions I have seen use regular expressions or if statements for each element in the array and I would rather not use either of those approaches if I don't have to. Thoughts?
There is no guarantee on the order of readdir. The docs state it...
Returns the next directory entry for a directory opened by opendir.
The whole thing is stepping through entries in the directory in whatever order they're provided by the filesystem. There is no guarantee what this order may be.
The usual way to work around this is with a regex or string equality.
my #dirs = grep { !/^\.{1,2}\z/ } readdir $dh;
my #dirs = grep { $_ ne '.' && $_ ne '..' } readdir $dh;
Because this is such a common issue, I'd recommend using Path::Tiny->children instead of rolling your own. They'll have figured out the fastest and safest way to do it, which is to use grep to filter out . and ... Path::Tiny fixes a lot of things about Perl file and directory handling.
This perlmonks thread from 2001 investigated this very issue, and Perl wizard Randal Schwartz concluded
readdir on Unix returns the underlying raw directory order. Additions and deletions to the directory use and free-up slots. The first two entries to any directory are always created as "dot" and "dotdot", and these entries are never deleted under normal operation.
However, if a directory entry for either of these gets incorrectly deleted (through corruption, or using the perl -U option and letting the superuser unlink it, for example), the next fsck run has to recreate the entry, and it will simply add it. Oops, dot and dotdot are no longer the first two entries!
So, defensive programming mandates that you do not count on the slot order. And there's no promise that dot and dotdot are the first two entries, because Perl can't control that, and the underlying OS doesn't promise it either.
Is there a Perl module (preferably core) that has a function that will tell me if a given filename is inside a directory (or a subdirectory of the directory, recursively)?
For example:
my $f = "/foo/bar/baz";
# prints 1
print is_inside_of($f, "/foo");
# prints 0
print is_inside_of($f, "/foo/asdf");
I could write my own, but there are some complicating factors such as symlinks, relative paths, whether it's OK to examine the filesystem or not, etc. I'd rather not reinvent the wheel.
Path::Tiny is not in core, but it has no non-core dependencies, so is a very quick and easy installation.
use Path::Tiny qw(path);
path("/usr/bin")->subsumes("/usr/bin/perl"); # true
Now, it does this entirely by looking at the file paths (after canonicalizing them), so it may or may not be adequate depending on what sort of behaviour you're expecting in edge cases like symlinks. But for most purposes it should be sufficient. (If you want to take into account hard links, the only way is to search through the entire directory structure and compare inode numbers.)
If you want the function to work for only a filename (without a path) and a path, you can use File::Find:
#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
sub is_inside_of {
my ($file, $path) = #_;
my $found;
find( sub { $found = 1 if $_ eq $file }, $path);
return $found
}
If you don't want to check the filesystem, but only process the path, see File::Spec for some functions that can help you. If you want to process symlinks, though, you can't avoid touching the file system.
I am supposed to traverse through a whole tree of folders and rename everything (including folders) to lower case. I looked around quite a bit and saw that the best way was to use File::Find. I tested this code:
#!/usr/bin/perl -w
use File::Find;
use strict;
print "Folder: ";
chomp(my $dir = <STDIN>);
find(\&lowerCase, $dir);
sub lowerCase{
print $_," = ",lc($_),"\n";
rename $_, lc($_);
}
and it seems to work fine. But can anyone tell me if I might run into trouble with this code? I remember posts on how I might run into trouble because of renaming folders before files or something like that.
If you are on Windows, as comments stated, then no, renaming files or folders in any order won't be a problem, because a path DIR1/file1 is the same as dir1/file1 to Windows.
It MAY be a problem on Unix though, in which case you are better off doing a recursive BFS by hand.
Also, when doing system calls like rename, ALWAYS check result:
rename($from, $to) || die "Error renaming $from to $to: $!";
As noted in comments, take care about renaming "ABC" to "abc". On Windows is not a problem.
Personally, I prefer to:
List files to be renamed using find dir/ > 2b_renamed
Review the list manually, using an editor of choice (vim 2b_renamed, in my case)
Use the rename from CPAN on that list: xargs rename 'y/A-Z/a-z/' < 2b_renamed
That manual review is very important to me, even when I can easily rollback changes (via git or even Time Machine).
Well I am back again, stuck on another seemingly simple routine.
I need to figure out how to do this with Perl.
1- I open a directory full of files named 1.txt, 2.txt ~ 100.txt.
(But sometimes the lowest numbered filename could in fact be any number (27.txt) due to 0-26.txt already removed from directory.)
(I found out how to implement ABS sort so; 1,2,3 not 1,10,11 ~ 2,20 was the order returned.)
use POSIX;
my #files = </home/****/users/*.txt>;
foreach $file (#files) {
##$file ABS($file)
##and so on..
##EXAMPLE NOT TRIED
}
2- I just want to return the lowest numbered file name in the directory into a $var.
Do I have to read the whole directory into an array, do an abs sort, then grab the first one in the array off?
Is there a more efficient way to grab the lowest numbered file?
More info:
The files were created by/with a loop so, I also contemplated grabbing the oldest file first if the creation time is actually that sensitive. But, I am a beginner and don't know if creation time is accurate enough, and how to use it or if in fact that is a viable solution.
Thanks for the help, I always find the best people here.
use strict;
use warnings;
use File::Slurp qw(read_dir);
use File::Spec::Functions qw(catfile);
my $directory = 'some/directory';
my #files = read_dir($directory);
my #ordered;
{
no warnings 'numeric';
#ordered = sort { $a <=> $b } #files;
}
my $lowest_file = catfile $directory, $ordered[0];
I am writing a Perl script (in Windows) that is using File::Find to index a network file system. It works great, but it takes a very long time to crawl the file system. I was thinking it would be nice to somehow get a checksum of a directory before traversing it, and it the checksum matches the checksum that was taken on a previous run, do not traverse the directory. This would eliminate a lot of processing, since the files on this file system do not change often.
On my AIX box, I use this command:
csum -h MD5 /directory
which returns something like this:
5cfe4faf4ad739219b6140054005d506 /directory
The command takes very little time:
time csum -h MD5 /directory
5cfe4faf4ad739219b6140054005d506 /directory
real 0m0.00s
user 0m0.00s
sys 0m0.00s
I have searched CPAN for a module that will do this, but it looks like all the modules will give me the MD5sum for every file in a directory, not for the directory itself.
Is there a way to get the MD5sum for a directory in Perl, or even in Windows for that matter as I could call a Win32 command from Perl?
Thanks in advance!
Can you just read the last modified dates of the files and folders? Surely that's going to be faster than building MD5's?
In order to get a checksum you must read the files, this means you will need to walk the filesystem, which puts you back in the same boat you are trying to get out of.
From what I know you cannot get an md5 of a directory. md5sum on other systems complains when you provide a directory. csum is most likely giving you a hash of the directory file contents of the top level directory, not traversing the tree.
You can grab the modified times for the files and hash them how you like by doing something like this:
sub dirModified($){
my $dir = #_[0];
opendir(DIR, "$dir");
my #dircontents = readdir(DIR);
closedir(DIR);
foreach my $item (#dircontents){
if( -f $item ){
print -M $item . " : $item - do stuff here\n";
} elsif( -d $item && $item !~ /^\.+$/ ){
dirModified("$dir/$item");
}
}
}
Yes it will take some time to run.
In addition to the other good answers, let me add this: if you want a checksum, then please use a checksum algorithm instead of a (broken!) hash function.
I don't think you don't need a cryptographically secure hash function in your file indexer -- instead you need a way to see if there are changes in the directory listings without storing the entire listing. Checksum algorithms do that: they return a different output when the input is changed. They might do it faster since they are simpler than hash functions.
It is true that a user could change a directory in a way that wouldn't be discovered by the checksum. However, a user would have to change the file names like this on purpose since normal changes in file names will (with high probability) give different checksums. Is it then necessary to guard against this "attack"?
One should always consider the consequences of each attack and choose the appropriate tools.
I did one of these in python if your interested:
http://akiscode.com/articles/sha-1directoryhash.shtml