I am running Perl in Windows and I am getting a list of all the files in a directory using readdir and storing the result in an array. The first two elements in the array seem to always be "." and "..". Is this order guaranteed (assuming the operating system does not change)?
I would like to do the following to remove these values:
my $directory = 'C:\\foo\\bar';
opendir my $directory_handle, $directory
or die "Could not open '$directory' for reading: $!\n";
my #files = readdir $directory_handle;
splice ( #files, 0, 2 ); # Remove the "." and ".." elements from the array
But I am worried that it might not be safe to do so. All the solutions I have seen use regular expressions or if statements for each element in the array and I would rather not use either of those approaches if I don't have to. Thoughts?
There is no guarantee on the order of readdir. The docs state it...
Returns the next directory entry for a directory opened by opendir.
The whole thing is stepping through entries in the directory in whatever order they're provided by the filesystem. There is no guarantee what this order may be.
The usual way to work around this is with a regex or string equality.
my #dirs = grep { !/^\.{1,2}\z/ } readdir $dh;
my #dirs = grep { $_ ne '.' && $_ ne '..' } readdir $dh;
Because this is such a common issue, I'd recommend using Path::Tiny->children instead of rolling your own. They'll have figured out the fastest and safest way to do it, which is to use grep to filter out . and ... Path::Tiny fixes a lot of things about Perl file and directory handling.
This perlmonks thread from 2001 investigated this very issue, and Perl wizard Randal Schwartz concluded
readdir on Unix returns the underlying raw directory order. Additions and deletions to the directory use and free-up slots. The first two entries to any directory are always created as "dot" and "dotdot", and these entries are never deleted under normal operation.
However, if a directory entry for either of these gets incorrectly deleted (through corruption, or using the perl -U option and letting the superuser unlink it, for example), the next fsck run has to recreate the entry, and it will simply add it. Oops, dot and dotdot are no longer the first two entries!
So, defensive programming mandates that you do not count on the slot order. And there's no promise that dot and dotdot are the first two entries, because Perl can't control that, and the underlying OS doesn't promise it either.
File::Find and the wanted subroutine
This question is much simpler than the original title ("prototypes and forward declaration of subroutines"!) lets on. I'm hoping the answer, however simple, will help me understand subroutines/functions, prototypes and scoping and the File::Find module.
With Perl, subroutines can appear pretty much anywhere and you normally don't need to make forward declarations (except if the sub declares a prototype, which I'm not sure how to do in a "standard" way in Perl). For what I usually do with Perl there's little difference between these different ways of running somefunction:
sub somefunction; # Forward declares the function
&somefunction;
somefunction();
somefunction; # Bare word warning under `strict subs`
I often use find2perl to generate code which I crib/hack into parts of scripts. This could well be bad style and now my dirty laundry is public, but so be it :-) For File::Find the wanted function is a required subroutine - find2perl creates it and adds sub wanted; to the resulting script it creates. Sometimes, when I edit the script I'll remove the "sub" from sub wanted and it ends up as &wanted; or wanted();. But without the sub wanted; forward declaration form I get this warning:
Use of uninitialized value $_ in lstat at findscript.pl line 29
My question is: why does this happen and is it a real problem? It is "just a warning", but I want to understand it better.
The documentation and code say $_ is localized inside of sub wanted {}. Why would it be undefined if I use wanted(); instead of sub wanted;?
Is wanted using prototypes somewhere? Am I missing something obvious in Find/File.pm?
Is it because wanted returns a code reference? (???)
My guess is that the forward declaration form "initializes" wanted in some way so that the first use doesn't have an empty default variable. I guess this would be how prototypes - even Perl prototypes, such as they exist - would work as well. I tried grepping through the Perl source code to get a sense of what sub is doing when a function is called using sub function instead of function(), but that may be beyond me at this point.
Any help deepening (and speeding up) my understanding of this is much appreciated.
EDIT: Here's a recent example script here on Stack Overflow that I created using find2perl's output. If you remove the sub from sub wanted; you should get the same error.
EDIT: As I noted in a comment below (but I'll flag it here too): for several months I've been using Path::Iterator::Rule instead of File::Find. It requires perl >5.10, but I never have to deploy production code at sites with odd, "never upgrade", 5.8.* only policies so Path::Iterator::Rule has become one of those modules I never want to do with out. Also useful is Path::Class. Cheers.
I'm not a big fan of File::Find. It just doesn't work right. The find command doesn't return a list of files, so you either have to use a non-local array variable in your find to capture your list of files you've found (not good), or place your entire program in your wanted subroutine (even worse). Plus, the separate subroutine means that your logic is separate from your find command. It's just ugly.
What I do is inline my wanted subroutine inside my find command. Subroutine stays with the find. Plus, my non-local array variable is now just part of my find command and doesn't look so bad
Here's how I handle the File::Find -- assuming I want files that have a .pl suffix:
my #file_list;
find ( sub {
return unless -f; #Must be a file
return unless /\.pl$/; #Must end with `.pl` suffix
push #file_list, $File::Find::name;
}, $directory );
# At this point, #file_list contains all of the files I found.
This is exactly the same as:
my #file_list;
find ( \&wanted, $directory );
sub wanted {
return unless -f;
return unless /\.pl$/;
push #file_list, $File::Find::name;
}
# At this point, #file_list contains all of the files I found.
In lining just looks nicer. And, it keep my code together. Plus, my non-local array variable doesn't look so freaky.
I also like taking advantage of the shorter syntax in this particular way. Normally, I don't like using the inferred $_, but in this case, it makes the code much easier to read. My original Wanted is the same as this:
sub wanted {
my $file_name = $_;
if ( -f $file_name and $file_name =~ /\.pl$/ ) {
push #file_list, $File::Find::name;
}
}
File::Find isn't that tricky to use. You just have to remember:
When you find a file you don't want, you use return to go to the next file.
$_ contains the file name without the directory, and you can use that for testing the file.
The file's full name is $File::Find::name.
The file's directory is $File::Find::dir.
And, the easiest way is to push the files you want into an array, and then use that array later in your program.
Removing the sub from sub wanted; just makes it a call to the wanted function, not a forward declaration.
However, the wanted function hasn't been designed to be called directly from your code - it's been designed to be called by File::Find. File::Find does useful stuff like populating$_ before calling it.
There's no need to forward-declare wanted here, but if you want to remove the forward declaration, remove the whole sub wanted; line - not just the word sub.
Instead of File::Find, I would recommend using the find_wanted function from File::Find::Wanted.
find_wanted takes two arguments:
a subroutine that returns true for any filename that you would want.
a list of the files you are searching for.
find_wanted returns an array containing the list of filenames that it found.
I used code like the following to find all the JPEG files in certain directories on a computer:
my #files = find_wanted( sub { -f && /\.jpg$/i }, #dirs );
Explanation of some of the syntax, for those that might need it:
sub {...} is an anonymous subroutine, where ... is replaced with the code of the subroutine.
-f checks that a filename refers to a "plain file"
&& is boolean and
/\.jpg$/i is a regular expression that checks that a filename ends in .jpg (case insensitively).
#dirs is an array containing the directory names to be searched. A single directory could be searched as well, in which case a scalar works too (e.g. $dir).
Why not use open and invoke the shell find? The user can edit $findcommand (below) to be anything they want, or can define it in real time based on arguments and options passed to a script.
#!/usr/bin/perl
use strict; use warnings;
my $findcommand='find . -type f -mtime 0';
open(FILELIST,"$findcommand |")||die("can't open $findcommand |");
my #filelist=<FILELIST>;
close FILELIST;
my $Nfilelist = scalar(#filelist);
print "Number of files is $Nfilelist \n";
I am supposed to traverse through a whole tree of folders and rename everything (including folders) to lower case. I looked around quite a bit and saw that the best way was to use File::Find. I tested this code:
#!/usr/bin/perl -w
use File::Find;
use strict;
print "Folder: ";
chomp(my $dir = <STDIN>);
find(\&lowerCase, $dir);
sub lowerCase{
print $_," = ",lc($_),"\n";
rename $_, lc($_);
}
and it seems to work fine. But can anyone tell me if I might run into trouble with this code? I remember posts on how I might run into trouble because of renaming folders before files or something like that.
If you are on Windows, as comments stated, then no, renaming files or folders in any order won't be a problem, because a path DIR1/file1 is the same as dir1/file1 to Windows.
It MAY be a problem on Unix though, in which case you are better off doing a recursive BFS by hand.
Also, when doing system calls like rename, ALWAYS check result:
rename($from, $to) || die "Error renaming $from to $to: $!";
As noted in comments, take care about renaming "ABC" to "abc". On Windows is not a problem.
Personally, I prefer to:
List files to be renamed using find dir/ > 2b_renamed
Review the list manually, using an editor of choice (vim 2b_renamed, in my case)
Use the rename from CPAN on that list: xargs rename 'y/A-Z/a-z/' < 2b_renamed
That manual review is very important to me, even when I can easily rollback changes (via git or even Time Machine).
I would like to find a file pattern on a directory pattern in Perl that will return many thousands of entries, like this:
find ~/mydir/*/??/???/???? -name "\*.$refinfilebase.search" -print
I've been told there are different ways to handle it? I.e.:
File::Find
glob()
opendir, readdir, grep
Diamond operator, e.g.: my #files = <$refinfilebase.search>
Which one would be most adequate to be able to run the script on older versions of Perl or minimal installations of Perl?
For very large directories, opendir() is probably safest, as it doesn't need to read everything in or do any filtering on it. This can be faster as the ordering isn't important, and on very large directories, on some operating systems, this can be a performance hit. opendir is also built-in with all systems.
Note the actual way it behaves may be different on different platforms. So you need to be careful in coding with it. This mainly affects which it returns for things like the parent and current directory, which you may need to treat specially.
glob() is more useful when you only want some files, matching by a pattern. File::Find is more useful when recursing through a set of nested directories. If you don't need either, opendir() is a good base.
Also you have DirHandle
DirHandle:
use DirHandle;
$d = new DirHandle ".";
if (defined $d) {
while (defined($_ = $d->read)) { something($_); }
$d->rewind;
while (defined($_ = $d->read)) { something_else($_); }
undef $d;
}
For use cases of readdir and glob see
What reasons are there to prefer glob over readdir (or vice-versa) in Perl?
I prefer to use glob for quickly grab a list of files in a dir (no subdirs) and process them like
map{process_bam($_)} glob(bam_files/*.bam)
This is more convenient because it does not take the . and .. even is you ask for (*) and also returns the full path if you use a dir in the glob pattern.
Also you can use glob quickly as a oneliner piped to xargs or in a bash for loop when you need to preprocess the filenames of the list:
perl -lE 'print join("\n", map {s/srf\/(.+).srf/$1/;$_} glob("srf/198*.srf"))' | xargs -n 1.....
Readdir has adventages in other scenarios so you need to use the one that fits better for your actions.
This question is a spin-off from this one. Some history: when I first learned Perl, I pretty much always used glob rather than opendir + readdir because I found it easier. Then later various posts and readings suggested that glob was bad, and so now I pretty much always use readdir.
After thinking over this recent question I realized that my reasons for one or the other choice may be bunk. So, I'm going to lay out some pros and cons, and I'm hoping that more experienced Perl folks can chime in and clarify. The question in a nutshell is are there compelling reasons to prefer glob to readdir or readdir to glob (in some or all cases)?
glob pros:
No dotfiles (unless you ask for them)
Order of items is guaranteed
No need to prepend the directory name onto items manually
Better name (c'mon - glob versus readdir is no contest if we're judging by names alone)
(From ysth's answer; cf. glob cons 4 below) Can return non-existent filenames:
#deck = glob "{A,K,Q,J,10,9,8,7,6,5,4,3,2}{\x{2660},\x{2665},\x{2666},\x{2663}}";
glob cons:
Older versions are just plain broken (but 'older' means pre 5.6, I think, and frankly if you're using pre 5.6 Perl, you have bigger problems)
Calls stat each time (i.e., useless use of stat in most cases).
Problems with spaces in directory names (is this still true?)
(From brian's answer) Can return filenames that don't exist:
$ perl -le 'print glob "{ab}{cd}"'
readdir pros:
(From brian's answer) opendir returns a filehandle which you can pass around in your program (and reuse), but glob simply returns a list
(From brian's answer) readdir is a proper iterator and provides functions to rewinddir, seekdir, telldir
Faster? (Pure guess based on some of glob's features from above. I'm not really worried about this level of optimization anyhow, but it's a theoretical pro.)
Less prone to edge-case bugs than glob?
Reads everything (dotfiles too) by default (this is also a con)
May convince you not to name a file 0 (a con also - see Brad's answer)
Anyone? Bueller? Bueller?
readdir cons:
If you don't remember to prepend the directory name, you will get bit when you try to do filetests or copy items or edit items or...
If you don't remember to grep out the . and .. items, you will get bit when you count items, or try to walk recursively down the file tree or...
Did I mention prepending the directory name? (A sidenote, but my very first post to the Perl Beginners mail list was the classic, "Why does this code involving filetests not work some of the time?" problem related to this gotcha. Apparently, I'm still bitter.)
Items are returned in no particular order. This means you will often have to remember to sort them in some manner. (This could be a pro if it means more speed, and if it means that you actually think about how and if you need to sort items.) Edit: Horrifically small sample, but on a Mac readdir returns items in alphabetical order, case insensitive. On a Debian box and an OpenBSD server, the order is utterly random. I tested the Mac with Apple's built-in Perl (5.8.8) and my own compiled 5.10.1. The Debian box is 5.10.0, as is the OpenBSD machine. I wonder if this is a filesystem issue, rather than Perl?
Reads everything (dotfiles too) by default (this is also a pro)
Doesn't necessarily deal well with a file named 0 (see pros also - see Brad's answer)
You missed the most important, biggest difference between them: glob gives you back a list, but opendir gives you a directory handle. You can pass that directory handle around to let other objects or subroutines use it. With the directory handle, the subroutine or object doesn't have to know anything about where it came from, who else is using it, and so on:
sub use_any_dir_handle {
my( $dh ) = #_;
rewinddir $dh;
...do some filtering...
return \#files;
}
With the dirhandle, you have a controllable iterator where you can move around with seekdir, although with glob you just get the next item.
As with anything though, the costs and benefits only make sense when applied to a certain context. They do not exist outside of a particular use. You have an excellent list of their differences, but I wouldn't classify those differences without knowing what you were trying to do with them.
Some other things to remember:
You can implement your own glob with opendir, but not the other way around.
glob uses its own wildcard syntax, and that's all you get.
glob can return filenames that don't exist:
$ perl -le 'print glob "{ab}{cd}"'
glob pros: Can return 'filenames' that don't exist:
my #deck = List::Util::shuffle glob "{A,K,Q,J,10,9,8,7,6,5,4,3,2}{\x{2660},\x{2665},\x{2666},\x{2663}}";
while (my #hand = splice #deck,0,13) {
say join ",", #hand;
}
__END__
6♥,8♠,7♠,Q♠,K♣,Q♦,A♣,3♦,6♦,5♥,10♣,Q♣,2♠
2♥,2♣,K♥,A♥,8♦,6♠,8♣,10♠,10♥,5♣,3♥,Q♥,K♦
5♠,5♦,J♣,J♥,J♦,9♠,2♦,8♥,9♣,4♥,10♦,6♣,3♠
3♣,A♦,K♠,4♦,7♣,4♣,A♠,4♠,7♥,J♠,9♥,7♦,9♦
glob makes it convenient to read all the subdirectories of a given fixed depth, as in glob "*/*/*". I've found this handy in several occasions.
Here is a disadvantage for opendir and readdir.
{
open my $file, '>', 0;
print {$file} 'Breaks while( readdir ){ ... }'
}
opendir my $dir, '.';
my $a = 0;
++$a for readdir $dir;
print $a, "\n";
rewinddir $dir;
my $b = 0;
++$b while readdir $dir;
print $b, "\n";
You would expect that code would print the same number twice, but it doesn't because there is a file with the name of 0. On my computer it prints 251, and 188, tested with Perl v5.10.0 and v5.10.1
This problem also makes it so that this just prints out a bunch of empty lines, regardless of the existence of file 0:
use 5.10.0;
opendir my $dir, '.';
say while readdir $dir;
Where as this always works just fine:
use 5.10.0;
my $a = 0;
++$a for glob '*';
say $a;
my $b = 0;
++$b while glob '*';
say $b;
say for glob '*';
say while glob '*';
I fixed these issues, and sent in a patch which made it into Perl v5.11.2, so this will work properly with Perl v5.12.0 when it comes out.
My fix converts this:
while( readdir $dir ){ ... }
into this:
while( defined( $_ = readdir $dir ){ ...}
Which makes it work the same way that read has worked on files. Actually it is the same bit of code, I just added another element to the corresponding if statements.
Well, you pretty much cover it. All that taken into account, I would tend to use glob when I'm throwing together a quick one-off script and its behavior is just what I want, and use opendir and readdir in ongoing production code or libraries where I can take my time and clearer, cleaner code is helpful.
That was a pretty comprehensive list. readdir (and readdir + grep) has less overhead than glob and so that is a plus for readdir if you need to analyze lots and lots of directories.
For small, simple things, I prefer glob. Just the other day, I used it and a twenty line perl script to retag a large portion of my music library. glob, however, has a pretty strange name. Glob? It's not intuitive at all, as far as a name goes.
My biggest hangup with readdir is that it treats a directory in a way that's somewhat odd to most people. Usually, programmers don't think of a directory as a stream, they think of it as a resource, or list, which glob provides. The name is better, the functionality is better, but the interface still leaves something to be desired.
glob pros:
3) No need to prepend the directory name onto items manually
Exception:
say for glob "*";
--output:--
1perl.pl
2perl.pl
2perl.pl.bak
3perl.pl
3perl.pl.bak
4perl.pl
data.txt
data1.txt
data2.txt
data2.txt.out
As far as I can tell, the rule for glob is: you must provide a full path to the directory to get full paths back. The Perl docs do not seem to mention that, and neither do any of the posts here.
That means that glob can be used in place of readdir when you want just filenames (rather than full paths), and you don't want hidden files returned, i.e. ones starting with '.'. For example,
chdir ("../..");
say for glob("*");
On a similar note, File::Slurp has a function called read_dir.
Since I use File::Slurp's other functions a lot in my scripts, read_dir has also become a habit.
It also has following options: err_mode, prefix, and keep_dot_dot.
First, do some reading. Chapter 9.6. of the Perl Cookbook outlines the point I want to get to nicely, just under the discussion heading.
Secondly, do a search for glob and dosglob in your Perl directory. While many different sources (ways to get the file list) can be used, the reason why I point you to dosglob is that if you happen to be on a Windows platform (and using the dosglob solution), it is actually using opendir/readdir/closedir. Other versions use built-in shell commands or precompiled OS specific executables.
If you know you are targetting a specific platform, you can use this information to your advantage. Just for reference I looked into this on Strawberry Perl Portable edition 5.12.2, so things may be slightly different on newer or original versions of Perl.