Perl's 'readdir' Function Result Order? - perl

I am running Perl in Windows and I am getting a list of all the files in a directory using readdir and storing the result in an array. The first two elements in the array seem to always be "." and "..". Is this order guaranteed (assuming the operating system does not change)?
I would like to do the following to remove these values:
my $directory = 'C:\\foo\\bar';
opendir my $directory_handle, $directory
or die "Could not open '$directory' for reading: $!\n";
my #files = readdir $directory_handle;
splice ( #files, 0, 2 ); # Remove the "." and ".." elements from the array
But I am worried that it might not be safe to do so. All the solutions I have seen use regular expressions or if statements for each element in the array and I would rather not use either of those approaches if I don't have to. Thoughts?

There is no guarantee on the order of readdir. The docs state it...
Returns the next directory entry for a directory opened by opendir.
The whole thing is stepping through entries in the directory in whatever order they're provided by the filesystem. There is no guarantee what this order may be.
The usual way to work around this is with a regex or string equality.
my #dirs = grep { !/^\.{1,2}\z/ } readdir $dh;
my #dirs = grep { $_ ne '.' && $_ ne '..' } readdir $dh;
Because this is such a common issue, I'd recommend using Path::Tiny->children instead of rolling your own. They'll have figured out the fastest and safest way to do it, which is to use grep to filter out . and ... Path::Tiny fixes a lot of things about Perl file and directory handling.

This perlmonks thread from 2001 investigated this very issue, and Perl wizard Randal Schwartz concluded
readdir on Unix returns the underlying raw directory order. Additions and deletions to the directory use and free-up slots. The first two entries to any directory are always created as "dot" and "dotdot", and these entries are never deleted under normal operation.
However, if a directory entry for either of these gets incorrectly deleted (through corruption, or using the perl -U option and letting the superuser unlink it, for example), the next fsck run has to recreate the entry, and it will simply add it. Oops, dot and dotdot are no longer the first two entries!
So, defensive programming mandates that you do not count on the slot order. And there's no promise that dot and dotdot are the first two entries, because Perl can't control that, and the underlying OS doesn't promise it either.

Related

perl glob < > (star) operator returning unexpected results?

I am using the
my $file = <*.ext>;
function within Perl to test if a file exists (in this case, I need to know if there is a file with the .ext in the current working directory, otherwise I do not proceed) and throw an error if it doesn't. Such as:
my $file = <*.ext>;
if (-e $file) {
# we are good
}
else {
# file not found
}
As you can see I am bringing the input of the <*.ext> in to a scalar $variable, not an #array. This is probably not a good idea, but it's been working for me up until now and I have spent a while figuring out where my code was failing... and this seems to be it.
Seems that when switching directories (I am using "chdir", on a Windows machine) the current working directory switches properly but the input from the glob operator is very undefined and will look in previous directories, or is retaining past values.
I've been able to fix getting this to work by doing
my #file = <*.ext>;
if (-e $file[0]) {
}
and I'm wondering if anybody can explain this, because I've been unable to find where the return value of the glob operator is defined as an array or a scalar (if there is only one file, etc.)
Just trying to learn here to avoid future bugs. This was a requirement that it be in Perl on Windows and not something I regularly have to do, so my experience in this case is very thin. (More of a Python/C++ guy.)
Thanks.
perldoc -f glob explains this behavior
In list context, returns a (possibly empty) list of filename expansions on the value of EXPR such as the standard Unix shell /bin/csh would do. In scalar context, glob iterates through such filename expansions, returning undef when the list is exhausted.
So you are using iterator version which should be used with while to loop over it (all the way until it gets exhausted).
As you clearly want to get only first value using list context you can,
my ($file) = <*.ext>;
mpapec has already covered the perldoc glob documentation concerning the bahavior in a scalar context.
However, I'd just like to add that you can simplify your logic by putting the glob directly in the if
if (<*.ext>) {
# we are good
}
else {
# no files found
}
The testing of -e is superfluous as a file wouldn't have been returned by the glob if it didn't exist. Additionally, if you want to actually perform an operation on the found files, you can capture them inside the if
if (my #files = <*.ext>) {

Does readdir() return its results in any particular order?

I have a directory that i am currently processing with a while loop. The file names are actually dates as such :
updates.20130831.1445.hr
updates.20140831.1445.hr
updates.20150831.1445.hr
updates.20160831.1445.hr
updates.20170831.1445.hr
However, somehow in my print statement, i noticed that the while loop doesn't start from the first file. rather picks the second or the 5th etc.
If while loops are sequential, why is the first file not processed first then the second, third etc ? am i missing something ?
my while loop is as such :
opendir(DIRECTORY, $Dir) or die $!;
while (my $file = readdir(DIRECTORY)) {
print "$file
open (IN, $file) or die "error reading file: ", $file,"\n"
The readdir builtin does not guarantee any order, so you would have to sort the names manually:
for my $file (sort readdir DIRECTORY) { ... }
Note that readdir does not output a complete path, so you have to prepend the directory name if you wish to open that file:
my $path = "$dir/$file";
Note that entries may be all kinds of file system objects including files, directories, symlinks, and named pipes. You should skip those that aren't files before trying to open them:
next if not -f $path;
In case you want to process directories, consider that the readdir output includes the . and .. directories (current and parent dir), which you should always filter out:
next if $file eq "." or $file eq "..";
If you want them sorted - then sort them.
opendir etc fetches the files in the order that they are stored on disk. Not alphabetic. It does not guarantee order
It doesn't really make sense to ask whether while is sequential. A while loop (in any C-inspired language) just repeatedly executes a block until a particular condition tests false.
If I'm hammering a nail into a block of wood, it could be described as "while the nail protrudes, hit it with the hammer". It doesn't really make sense to ask if my hammer hits were in the correct order.
The constructs that it makes sense to ask whether they are sequential are things that operate on an ordered data structure, such as an array or list. For example, grep, map, and foreach. These all operate sequentially in Perl.
What it also makes sense to ask is "does readdir() return its results in any particular order?" The answer to this question is that it does, but it's not a particularly useful order - certainly not alphabetic/lexicographic order. If you want your files listed in a particular order, you should slurp readdir() into an array and then sort that array.

finding many thousands of files in a directory pattern in Perl

I would like to find a file pattern on a directory pattern in Perl that will return many thousands of entries, like this:
find ~/mydir/*/??/???/???? -name "\*.$refinfilebase.search" -print
I've been told there are different ways to handle it? I.e.:
File::Find
glob()
opendir, readdir, grep
Diamond operator, e.g.: my #files = <$refinfilebase.search>
Which one would be most adequate to be able to run the script on older versions of Perl or minimal installations of Perl?
For very large directories, opendir() is probably safest, as it doesn't need to read everything in or do any filtering on it. This can be faster as the ordering isn't important, and on very large directories, on some operating systems, this can be a performance hit. opendir is also built-in with all systems.
Note the actual way it behaves may be different on different platforms. So you need to be careful in coding with it. This mainly affects which it returns for things like the parent and current directory, which you may need to treat specially.
glob() is more useful when you only want some files, matching by a pattern. File::Find is more useful when recursing through a set of nested directories. If you don't need either, opendir() is a good base.
Also you have DirHandle
DirHandle:
use DirHandle;
$d = new DirHandle ".";
if (defined $d) {
while (defined($_ = $d->read)) { something($_); }
$d->rewind;
while (defined($_ = $d->read)) { something_else($_); }
undef $d;
}
For use cases of readdir and glob see
What reasons are there to prefer glob over readdir (or vice-versa) in Perl?
I prefer to use glob for quickly grab a list of files in a dir (no subdirs) and process them like
map{process_bam($_)} glob(bam_files/*.bam)
This is more convenient because it does not take the . and .. even is you ask for (*) and also returns the full path if you use a dir in the glob pattern.
Also you can use glob quickly as a oneliner piped to xargs or in a bash for loop when you need to preprocess the filenames of the list:
perl -lE 'print join("\n", map {s/srf\/(.+).srf/$1/;$_} glob("srf/198*.srf"))' | xargs -n 1.....
Readdir has adventages in other scenarios so you need to use the one that fits better for your actions.

Should Perl's opendir always return . and .. first?

opendir MYDIR, "$dir";
my #FILES = readdir MYDIR;
closedir MYDIR;
It appears that 99.9 % of the time the first two entries in the array are always “.” and “..”. Later logic in the script has issues if it is not true. I ran into a case where the directory entries appeared later. Is this indicative of the file system being corrupt or something else? Is there a known order to what opendir returns?
It's always the operating-system order, presented unsorted raw.
While . and .. are very often the first two entries, that's because they were the first two entries created. If for some reason, one of them were deleted (via unnatural sequences, since it's normally prevented), the next fsck (or equivalent) would fix the directory to have both again. This would place one of the names at a later place in the list.
Hence, do not just "skip the first two entries". Instead, match them explicitly to reject them.
The order is down to the OS and is explicitly not otherwise defined.
They're easy enough to filter out.
opendir MYDIR, "$dir";
my #FILES = grep !/^\.\.?$/, readdir MYDIR ;
closedir MYDIR;
Use File::Slurp::read_dir which, by default, returns a list that does not include . and ...

What reasons are there to prefer glob over readdir (or vice-versa) in Perl?

This question is a spin-off from this one. Some history: when I first learned Perl, I pretty much always used glob rather than opendir + readdir because I found it easier. Then later various posts and readings suggested that glob was bad, and so now I pretty much always use readdir.
After thinking over this recent question I realized that my reasons for one or the other choice may be bunk. So, I'm going to lay out some pros and cons, and I'm hoping that more experienced Perl folks can chime in and clarify. The question in a nutshell is are there compelling reasons to prefer glob to readdir or readdir to glob (in some or all cases)?
glob pros:
No dotfiles (unless you ask for them)
Order of items is guaranteed
No need to prepend the directory name onto items manually
Better name (c'mon - glob versus readdir is no contest if we're judging by names alone)
(From ysth's answer; cf. glob cons 4 below) Can return non-existent filenames:
#deck = glob "{A,K,Q,J,10,9,8,7,6,5,4,3,2}{\x{2660},\x{2665},\x{2666},\x{2663}}";
glob cons:
Older versions are just plain broken (but 'older' means pre 5.6, I think, and frankly if you're using pre 5.6 Perl, you have bigger problems)
Calls stat each time (i.e., useless use of stat in most cases).
Problems with spaces in directory names (is this still true?)
(From brian's answer) Can return filenames that don't exist:
$ perl -le 'print glob "{ab}{cd}"'
readdir pros:
(From brian's answer) opendir returns a filehandle which you can pass around in your program (and reuse), but glob simply returns a list
(From brian's answer) readdir is a proper iterator and provides functions to rewinddir, seekdir, telldir
Faster? (Pure guess based on some of glob's features from above. I'm not really worried about this level of optimization anyhow, but it's a theoretical pro.)
Less prone to edge-case bugs than glob?
Reads everything (dotfiles too) by default (this is also a con)
May convince you not to name a file 0 (a con also - see Brad's answer)
Anyone? Bueller? Bueller?
readdir cons:
If you don't remember to prepend the directory name, you will get bit when you try to do filetests or copy items or edit items or...
If you don't remember to grep out the . and .. items, you will get bit when you count items, or try to walk recursively down the file tree or...
Did I mention prepending the directory name? (A sidenote, but my very first post to the Perl Beginners mail list was the classic, "Why does this code involving filetests not work some of the time?" problem related to this gotcha. Apparently, I'm still bitter.)
Items are returned in no particular order. This means you will often have to remember to sort them in some manner. (This could be a pro if it means more speed, and if it means that you actually think about how and if you need to sort items.) Edit: Horrifically small sample, but on a Mac readdir returns items in alphabetical order, case insensitive. On a Debian box and an OpenBSD server, the order is utterly random. I tested the Mac with Apple's built-in Perl (5.8.8) and my own compiled 5.10.1. The Debian box is 5.10.0, as is the OpenBSD machine. I wonder if this is a filesystem issue, rather than Perl?
Reads everything (dotfiles too) by default (this is also a pro)
Doesn't necessarily deal well with a file named 0 (see pros also - see Brad's answer)
You missed the most important, biggest difference between them: glob gives you back a list, but opendir gives you a directory handle. You can pass that directory handle around to let other objects or subroutines use it. With the directory handle, the subroutine or object doesn't have to know anything about where it came from, who else is using it, and so on:
sub use_any_dir_handle {
my( $dh ) = #_;
rewinddir $dh;
...do some filtering...
return \#files;
}
With the dirhandle, you have a controllable iterator where you can move around with seekdir, although with glob you just get the next item.
As with anything though, the costs and benefits only make sense when applied to a certain context. They do not exist outside of a particular use. You have an excellent list of their differences, but I wouldn't classify those differences without knowing what you were trying to do with them.
Some other things to remember:
You can implement your own glob with opendir, but not the other way around.
glob uses its own wildcard syntax, and that's all you get.
glob can return filenames that don't exist:
$ perl -le 'print glob "{ab}{cd}"'
glob pros: Can return 'filenames' that don't exist:
my #deck = List::Util::shuffle glob "{A,K,Q,J,10,9,8,7,6,5,4,3,2}{\x{2660},\x{2665},\x{2666},\x{2663}}";
while (my #hand = splice #deck,0,13) {
say join ",", #hand;
}
__END__
6♥,8♠,7♠,Q♠,K♣,Q♦,A♣,3♦,6♦,5♥,10♣,Q♣,2♠
2♥,2♣,K♥,A♥,8♦,6♠,8♣,10♠,10♥,5♣,3♥,Q♥,K♦
5♠,5♦,J♣,J♥,J♦,9♠,2♦,8♥,9♣,4♥,10♦,6♣,3♠
3♣,A♦,K♠,4♦,7♣,4♣,A♠,4♠,7♥,J♠,9♥,7♦,9♦
glob makes it convenient to read all the subdirectories of a given fixed depth, as in glob "*/*/*". I've found this handy in several occasions.
Here is a disadvantage for opendir and readdir.
{
open my $file, '>', 0;
print {$file} 'Breaks while( readdir ){ ... }'
}
opendir my $dir, '.';
my $a = 0;
++$a for readdir $dir;
print $a, "\n";
rewinddir $dir;
my $b = 0;
++$b while readdir $dir;
print $b, "\n";
You would expect that code would print the same number twice, but it doesn't because there is a file with the name of 0. On my computer it prints 251, and 188, tested with Perl v5.10.0 and v5.10.1
This problem also makes it so that this just prints out a bunch of empty lines, regardless of the existence of file 0:
use 5.10.0;
opendir my $dir, '.';
say while readdir $dir;
Where as this always works just fine:
use 5.10.0;
my $a = 0;
++$a for glob '*';
say $a;
my $b = 0;
++$b while glob '*';
say $b;
say for glob '*';
say while glob '*';
I fixed these issues, and sent in a patch which made it into Perl v5.11.2, so this will work properly with Perl v5.12.0 when it comes out.
My fix converts this:
while( readdir $dir ){ ... }
into this:
while( defined( $_ = readdir $dir ){ ...}
Which makes it work the same way that read has worked on files. Actually it is the same bit of code, I just added another element to the corresponding if statements.
Well, you pretty much cover it. All that taken into account, I would tend to use glob when I'm throwing together a quick one-off script and its behavior is just what I want, and use opendir and readdir in ongoing production code or libraries where I can take my time and clearer, cleaner code is helpful.
That was a pretty comprehensive list. readdir (and readdir + grep) has less overhead than glob and so that is a plus for readdir if you need to analyze lots and lots of directories.
For small, simple things, I prefer glob. Just the other day, I used it and a twenty line perl script to retag a large portion of my music library. glob, however, has a pretty strange name. Glob? It's not intuitive at all, as far as a name goes.
My biggest hangup with readdir is that it treats a directory in a way that's somewhat odd to most people. Usually, programmers don't think of a directory as a stream, they think of it as a resource, or list, which glob provides. The name is better, the functionality is better, but the interface still leaves something to be desired.
glob pros:
3) No need to prepend the directory name onto items manually
Exception:
say for glob "*";
--output:--
1perl.pl
2perl.pl
2perl.pl.bak
3perl.pl
3perl.pl.bak
4perl.pl
data.txt
data1.txt
data2.txt
data2.txt.out
As far as I can tell, the rule for glob is: you must provide a full path to the directory to get full paths back. The Perl docs do not seem to mention that, and neither do any of the posts here.
That means that glob can be used in place of readdir when you want just filenames (rather than full paths), and you don't want hidden files returned, i.e. ones starting with '.'. For example,
chdir ("../..");
say for glob("*");
On a similar note, File::Slurp has a function called read_dir.
Since I use File::Slurp's other functions a lot in my scripts, read_dir has also become a habit.
It also has following options: err_mode, prefix, and keep_dot_dot.
First, do some reading. Chapter 9.6. of the Perl Cookbook outlines the point I want to get to nicely, just under the discussion heading.
Secondly, do a search for glob and dosglob in your Perl directory. While many different sources (ways to get the file list) can be used, the reason why I point you to dosglob is that if you happen to be on a Windows platform (and using the dosglob solution), it is actually using opendir/readdir/closedir. Other versions use built-in shell commands or precompiled OS specific executables.
If you know you are targetting a specific platform, you can use this information to your advantage. Just for reference I looked into this on Strawberry Perl Portable edition 5.12.2, so things may be slightly different on newer or original versions of Perl.