Perl glob returning a false positive - perl

What seemed liked a straightforward piece of code most certainly didn't do what I wanted it to do.
Can somebody explain to me what it does do and why?
my $dir = './some/directory';
if ( -d $dir && <$dir/*> ) {
print "Dir exists and has non-hidden files in it\n";
}
else {
print "Dir either does not exist or has no non-hidden files in it\n";
}
In my test case, the directory did exist and it was empty. However, the then (first) section of the if triggered instead of the else section as expected.
I don't need anybody to suggest how to accomplish what I want to accomplish. I just want to understand Perl's interpretation of this code, which definitely does not match mine.

Using glob (aka <filepattern>) in a scalar context makes it an iterator; it will return one file at a time each time it is called, and will not respond to changes in the pattern (e.g. a different $dir) until it has finished iterating over the initial results; I suspect this is causing the trouble you see.
The easy answer is to always use it in list context, like so:
if( -d $dir && ( () = <$dir/*> ) ) {
glob may only really be used safely in scalar context in code you will execute more than once if you are absolutely sure you will exhaust the iterator before you try to start a new iteration. Most of the time it's just easier to avoid glob in scalar context altogether.

I believe that #ysth is on the right track, but repeated calls to glob in scalar context don't generate false positives.
For example
use strict;
use warnings;
use 5.010;
say scalar glob('/usr/*'), "\n";
say scalar glob('/usr/*'), "\n";
output
/usr/bin
/usr/bin
But what is true is that any single call to glob maintains a state, so if I have
use strict;
use warnings;
use 5.010;
for my $dir ( '/sys', '/usr', '/sys', '/usr' ) {
say scalar glob("$dir/*"), "\n";
}
output
/sys/block
/sys/bus
/sys/class
/sys/dev
So clearly that glob statement inside the loop is maintaining a state, and ignoring the changes to $dir.
This is similar to the way that the pos (and corresponding \G regex anchor) has a state per scalar variable, and how print without a specific file handle prints to the last selected handle. In the end it is how all of Perl works, with the it variable $_ being the ultimate example.

Related

Line Input operator with glob returning old values

The following excerpt code, when running on perl 5.16.3 and older versions, has a strange behavior, where subsequent calls to a glob in the line input operator causes the glob to continue returning previous values, rather than running the glob anew.
#!/usr/bin/env perl
use strict;
use warnings;
my #dirs = ("/tmp/foo", "/tmp/bar");
foreach my $dir (#dirs) {
my $count = 0;
my $glob = "*";
print "Processing $glob in $dir\n";
while (<$dir/$glob>) {
print "Processing file $_\n";
$count++;
last if $count > 0;
}
}
If you put two files in /tmp/foo and one or more in /tmp/bar, and run the code, I get the following output:
Processing * in /tmp/foo
Processing file /tmp/foo/foo.1
Processing * in /tmp/bar
Processing file /tmp/foo/foo.2
I thought that when the while terminates after the last, that the new invocation of the while on the second iteration would re-run the glob and give me the files listed /tmp/bar, but instead I get a continuation of what's in /tmp/foo.
It's almost like the angle operator glob is acting like a precompiled pattern. My hypothesis is that the angle operator is creating a filehandle in the symbol table that's still open and being reused behind the scenes, and that it's scoped to the containing foreach, or possibly the whole subroutine.
From I/O Operators in perlop
(my emphasis)
A (file)glob evaluates its (embedded) argument only when it is starting a
new list. All values must be read before it will start over. In list
context, this isn't important because you automatically get them all
anyway. However, in scalar context the operator returns the next value
each time it's called, or undef when the list has run out.
Since <> is called in scalar context here and you exit the loop with last after the first iteration, the next time you enter it it keeps reading from the original list.
It is clarified in comments that there is a practical need behind this quest: process only some of the files from a directory and never return all filenames since there can be many.
So assigning from glob to a list and working with it, or better yet using for instead of while as commented by ysth, doesn't help here as it returns a huge list.
I haven't found a way to make glob (what <> with a filename pattern uses) drop and rebuild the list once it's generated it, without getting to its end first.
Apparently, each instance of the operator gets its own list. So using another <> inside the while loop with the hope of resetting it, in any way and even with the same pattern, doesn't affect the list being iterated over in while (<$glob>).
Just to note, breaking out of the loop with a die (with while in an eval) doesn't help either; the next time we come to that while the same list is continued. Wrapping it in a closure
sub iter_glob { my $dir = shift; return sub { scalar <"$dir/*"> } }
for my $d (#dirs) {
my $iter = iter_glob($d);
while (my $f = $iter->()) {
# ...
}
}
met with the same fate; the original list keeps being used.
The solution then is to use readdir instead.

Perl: Dirhandle,

I was reading how-do-i-read-in-the-contents-of-a-directory and wanted to find out more about doing it without opening and closing directories as shown in #davidprecious' answer. Tried to read up on DirHandle (hoped for more explanation and example) and several other places simply redirected me to the same perldoc page. Still unsure about where to stipulate the path to read.
Say if I wanted the contents of "E:\parent\sub1\sub2\" and put that into a string variable like $p, where do I mention $p when using Dirhandle?
Would appreciate some guidance. Thanks.
Personally, I'd suggest that's too complicated, and what you probably want is glob:
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $file ( glob "E:\\parent\\sub1\\sub2\\*" ) {
print $file,"\n";
}
Although note - glob gives you the path to the file, not the filename. That's (IMO) generally more useful, because you can just pass the result to open, where if you're doing a readdir you get a file name and need to stick a path on it.
However if you do want to persist with doing it via DirHandle:
#!/usr/bin/env perl
use strict;
use warnings;
use DirHandle;
my $dir_handle = DirHandle -> new ( "C:\\Users\\Rolison\\" );
while ( my $entry = $dir_handle -> read ) {
print $entry,"\n";
}
Don't use $p as a variable name - single character variable names are almost always bad style.
It's probably worth pointing out that Windows is quite happy to use forward slashes (/) as directory separators - which avoids having to have all those ugly double backslashes.
my $dir_handle = DirHandle->new('E:/parent/sub1/sub2/');
while ( my $entry = $dir_handle->read ) {
say $entry;
}

Recursive directory traversal in perl

I intend to recursively traverse a directory containing this piece of perl script.
The idea is to traverse all directories whose parent directory contains the perl script and list all files path into a single array variable. Then return the list.
Here comes the error msg:
readdir() attempted on invalid dirhandle $DIR at xxx
closedir() attempted on invalid dirhandle $DIR at xxx
Code is attached for reference, Thank you in advance.
use strict;
use warnings;
use Cwd;
our #childfile = ();
sub recursive_dir{
my $current_dir = $_[0]; # a parameter
opendir(my $DIR,$current_dir) or die "Fail to open current directory,error: $!";
while(my $contents = readdir($DIR)){
next if ($contents =~ m/^\./); # filter out "." and ".."
#if-else clause separate dirs from files
if(-d "$contents"){
#print getcwd;
#print $contents;
#closedir($DIR);
recursive_dir(getcwd."/$contents");
print getcwd."/$contents";
}
else{
if($contents =~ /(?<!\.pl)$/){
push(#childfile,$contents);
}
}
}
closedir($DIR);
#print #childfile;
return #childfile;
}
recursive_dir(getcwd);
Please tell us if this is homework? You are welcome to ask for help with assignments, but it changes the sort of answer you should be given.
You are relying on getcwd to give you the current directory that you are processing, yet you never change the current working directory so your program will loop endlessly and eventually run out of memory. You should simply use $current_dir instead.
I don't believe that those error messages can be produced by the program you show. Your code checks whether opendir has succeeded and the program dies unless $DIR is valid, so the subsequent readdir and closedir must be using a valid handle.
Some other points:
Comments like # a parameter are ridiculous and only serve to clutter your code
Upper-case letters are generally reserved for global identifiers like package names. And $dir is a poor name for a directory handle, as it could also mean the directory name or the directory path. Use $dir_handle or $dh
It is crazy to use a negative look-behind just to check that a file name doesn't end with .pl. Just use push #childfile, $contents unless $contents =~ /\.pl$/
You never use the return value from your subroutine, so it is wasteful of memory to return what could be an enormous array from every call. #childfile is accessible throughout the program so you can just access it directly from anywhere
Don't put scalar variables inside double quotes. It simply forces the value to a string, which is probably unnecessary and may cause arcane bugs. Use just -d $contents
You probably want to ignore symbolic links, as otherwise you could be looping endlessly. You should change else { ... } to elsif (-f $contents) { ... }

Perl - Use of uninitialized value within %frequency in concatenation (.) or string

Not entirely sure why but for some reason i cant print the hash value outside the while loop.
#!/usr/bin/perl -w
opendir(D, "cwd" );
my #files = readdir(D);
closedir(D);
foreach $file (#files)
{
open F, $file or die "$0: Can't open $file : $!\n";
while ($line = <F>) {
chomp($line);
$line=~ s/[-':!?,;".()]//g;
$line=~ s/^[a-z]/\U/g;
#words = split(/\s/, $line);
foreach $word (#words) {
$frequency{$word}++;
$counter++;
}
}
close(F);
print "$file\n";
print "$ARGV[0]\n";
print "$frequency{$ARGV[0]}\n";
print "$counter\n";
}
Any help would be much appreciated!
cheers.
This line
print "$frequency{$ARGV[0]}\n";
Expects you to have an argument to your script, e.g. perl script.pl argument. If you have no argument, $ARGV[0] is undefined, but it will stringify to the empty string. This empty string is a valid key in the hash, but the value is undefined, hence your warning
Use of uninitialized value within %frequency in concatenation (.) or string
But you should also see the warning
Use of uninitialized value $ARGV[0] in hash element
And it is a very big mistake not to include that error in this question.
Also, when using readdir, you get all the files in the directory, including directories. You might consider filtering the files somewhat.
Using
use strict;
use warnings;
Is something that will benefit you very much, so add that to your script.
I had originally written this,
There is no %frequency defined at the top level of your program.
When perl sees you reference %frequency inside the inner-most
loop, it will auto-vivify it, in that scratchpad (lexical scope).
This means that when you exit the inner-most loop (foreach $word
(#words)), the auto-vivified %frequency is out of scope and
garbage-collected. Each time you enter that loop, a new, different
variable will be auto-vivified, and then discarded.
When you later refer to %frequency in your print, yet another new,
different %frequency will be created.
… but then realized that you had forgotten to use strict, and Perl was being generous and giving you a global %frequency, which ironically is probably what you meant. So, this answer is wrong in your case … but declaring the scope of %frequency would probably be good form, regardless.
These other, “unrelated” notes are still useful perhaps, or else I'd delete the answer altogether:
As #TLP mentioned, you should probably also skip directories (at least) in your file loop. A quick way to do this would be my #files = grep { -f "cwd/$_" } (readdir D); this will filter the list to contain only files.
I'm further suspicious that you named a directory "cwd" … are you perhaps meaning the current working directory? In all the major OS'es in use today, that directory is referenced as “.” — you're looking for a directory literally named "cwd"?

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}