Perl: Dirhandle, - perl

I was reading how-do-i-read-in-the-contents-of-a-directory and wanted to find out more about doing it without opening and closing directories as shown in #davidprecious' answer. Tried to read up on DirHandle (hoped for more explanation and example) and several other places simply redirected me to the same perldoc page. Still unsure about where to stipulate the path to read.
Say if I wanted the contents of "E:\parent\sub1\sub2\" and put that into a string variable like $p, where do I mention $p when using Dirhandle?
Would appreciate some guidance. Thanks.

Personally, I'd suggest that's too complicated, and what you probably want is glob:
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $file ( glob "E:\\parent\\sub1\\sub2\\*" ) {
print $file,"\n";
}
Although note - glob gives you the path to the file, not the filename. That's (IMO) generally more useful, because you can just pass the result to open, where if you're doing a readdir you get a file name and need to stick a path on it.
However if you do want to persist with doing it via DirHandle:
#!/usr/bin/env perl
use strict;
use warnings;
use DirHandle;
my $dir_handle = DirHandle -> new ( "C:\\Users\\Rolison\\" );
while ( my $entry = $dir_handle -> read ) {
print $entry,"\n";
}
Don't use $p as a variable name - single character variable names are almost always bad style.

It's probably worth pointing out that Windows is quite happy to use forward slashes (/) as directory separators - which avoids having to have all those ugly double backslashes.
my $dir_handle = DirHandle->new('E:/parent/sub1/sub2/');
while ( my $entry = $dir_handle->read ) {
say $entry;
}

Related

Perl, find a match and read next line in perl

I would like to use
myscript.pl targetfolder/*
to read some number from ASCII files.
myscript.pl
#list = <#ARGV>;
# Is the whole file or only 1st line is loaded?
foreach $file ( #list ) {
open (F, $file);
}
# is this correct to judge if there is still file to load?
while ( <F> ) {
match_replace()
}
sub match_replace {
# if I want to read the 5th line in downward, how to do that?
# if I would like to read multi lines in multi array[row],
# how to do that?
if ( /^\sName\s+/ ) {
$name = $1;
}
}
I would recommend a thorough read of perlintro - it will give you a lot of the information you need. Additional comments:
Always use strict and warnings. The first will enforce some good coding practices (like for example declaring variables), the second will inform you about potential mistakes. For example, one warning produced by the code you showed would be readline() on unopened filehandle F, giving you the hint that F is not open at that point (more on that below).
#list = <#ARGV>;: This is a bit tricky, I wouldn't recommend it - you're essentially using glob, and expanding targetfolder/* is something your shell should be doing, and if you're on Windows, I'd recommend Win32::Autoglob instead of doing it manually.
foreach ... { open ... }: You're not doing anything with the files once you've opened them - the loop to read from the files needs to be inside the foreach.
"Is the whole file or only 1st line is loaded?" open doesn't read anything from the file, it just opens it and provides a filehandle (which you've named F) that you then need to read from.
I'd strongly recommend you use the more modern three-argument form of open and check it for errors, as well as use lexical filehandles since their scope is not global, as in open my $fh, '<', $file or die "$file: $!";.
"is this correct to judge if there is still file to load?" Yes, while (<$filehandle>) is a good way to read a file line-by-line, and the loop will end when everything has been read from the file. You may want to use the more explicit form while (my $line = <$filehandle>), so that your variable has a name, instead of the default $_ variable - it does make the code a bit more verbose, but if you're just starting out that may be a good thing.
match_replace(): You're not passing any parameters to the sub. Even though this code might still "work", it's passing the current line to the sub through the global $_ variable, which is not a good practice because it will be confusing and error-prone once the script starts getting longer.
if (/^\sName\s+/){$name = $1;}: Since you've named the sub match_replace, I'm guessing you want to do a search-and-replace operation. In Perl, that's called s/search/replacement/, and you can read about it in perlrequick and perlretut. As for the code you've shown, you're using $1, but you don't have any "capture groups" ((...)) in your regular expression - you can read about that in those two links as well.
"if I want to read the 5th line in downward , how to do that ?" As always in Perl, There Is More Than One Way To Do It (TIMTOWTDI). One way is with the range operator .. - you can skip the first through fourth lines by saying next if 1..4; at the beginning of the while loop, this will test those line numbers against the special $. variable that keeps track of the most recently read line number.
"and if I would like to read multi lines in multi array[row], how to do that ?" One way is to use push to add the current line to the end of an array. Since keeping the lines of a file in an array can use up more memory, especially with large files, I'd strongly recommend making sure you think through the algorithm you want to use here. You haven't explained why you would want to keep things in an array, so I can't be more specific here.
So, having said all that, here's how I might have written that code. I've added some debugging code using Data::Dumper - it's always helpful to see the data that your script is working with.
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
for my $file (#ARGV) {
print Dumper($file); # debug
open my $fh, '<', $file or die "$file: $!";
while (my $line = <$fh>) {
next if 1..4;
chomp($line); # remove line ending
match_replace($line);
}
close $fh;
}
sub match_replace {
my ($line) = #_; # get argument(s) to sub
my $name;
if ( $line =~ /^\sName\s+(.*)$/ ) {
$name = $1;
}
print Data::Dumper->Dump([$line,$name],['line','name']); # debug
# ... do more here ...
}
The above code is explicitly looping over #ARGV and opening each file, and I did say above that more verbose code can be helpful in understanding what's going on. I just wanted to point out a nice feature of Perl, the "magic" <> operator (discussed in perlop under "I/O Operators"), which will automatically open the files in #ARGV and read lines from them. (There's just one small thing, if I want to use the $. variable and have it count the lines per file, I need to use the continue block I've shown below, this is explained in eof.) This would be a more "idiomatic" way of writing that first loop:
while (<>) { # reads line into $_
next if 1..4;
chomp; # automatically uses $_ variable
match_replace($_);
} continue { close ARGV if eof } # needed for $. (and range operator)

Perl glob returning a false positive

What seemed liked a straightforward piece of code most certainly didn't do what I wanted it to do.
Can somebody explain to me what it does do and why?
my $dir = './some/directory';
if ( -d $dir && <$dir/*> ) {
print "Dir exists and has non-hidden files in it\n";
}
else {
print "Dir either does not exist or has no non-hidden files in it\n";
}
In my test case, the directory did exist and it was empty. However, the then (first) section of the if triggered instead of the else section as expected.
I don't need anybody to suggest how to accomplish what I want to accomplish. I just want to understand Perl's interpretation of this code, which definitely does not match mine.
Using glob (aka <filepattern>) in a scalar context makes it an iterator; it will return one file at a time each time it is called, and will not respond to changes in the pattern (e.g. a different $dir) until it has finished iterating over the initial results; I suspect this is causing the trouble you see.
The easy answer is to always use it in list context, like so:
if( -d $dir && ( () = <$dir/*> ) ) {
glob may only really be used safely in scalar context in code you will execute more than once if you are absolutely sure you will exhaust the iterator before you try to start a new iteration. Most of the time it's just easier to avoid glob in scalar context altogether.
I believe that #ysth is on the right track, but repeated calls to glob in scalar context don't generate false positives.
For example
use strict;
use warnings;
use 5.010;
say scalar glob('/usr/*'), "\n";
say scalar glob('/usr/*'), "\n";
output
/usr/bin
/usr/bin
But what is true is that any single call to glob maintains a state, so if I have
use strict;
use warnings;
use 5.010;
for my $dir ( '/sys', '/usr', '/sys', '/usr' ) {
say scalar glob("$dir/*"), "\n";
}
output
/sys/block
/sys/bus
/sys/class
/sys/dev
So clearly that glob statement inside the loop is maintaining a state, and ignoring the changes to $dir.
This is similar to the way that the pos (and corresponding \G regex anchor) has a state per scalar variable, and how print without a specific file handle prints to the last selected handle. In the end it is how all of Perl works, with the it variable $_ being the ultimate example.

PERL: String Replacement on file

I am working on a script to do a string replacement in a file and I will read the variables and values and files from a configuration file and do string replacement.
Here is my logic to do a string replacement.
sub expansion($$$){
my $f = shift(#_) ; # file Name
my $vname = shift(#_) ; # variable name for pattern match
my $value = shift(#_) ; # value to replace
my $n = "$f".".new";
open ( O, "<$f") or print( "Can't open $f file: $!");
open ( N ,">$n" ) or print( "Can't open $n file: $!");
while (<O>)
{
$_ =~ s/$vname/$value/g; #check for pattern
print N "$_" ;
}
close (O);
close (N);
}
In my logic am reading line by line in from input file ($f) for the pattern and writing to a new file ($n) .
Instead of write to a new file is there any way to do a string replacement the original file when I try to do the same it has only empty file with no contents.
Do not. Never, ever1. Don't you dare, Don't even think of, do not use subroutine prototyping. It is horribly broken (that is, it doesn't do what you think it does) and is dangerous.
Now, we got that out of the way:
Yes, you can do what you want. You can open a file as both read and writable by using the mode <+. So far, so good.
However, due to buffering, you cannot use the standard read and write methods to read and write to the file. Instead, you need to use sysread and syswrite.
Then, what you need to do is read the line, use sysseek to go back to the start of where you read, and then write to that spot.
Not only is it very complex to do, but it is full of peril. Let's take a simple example. I have a document, and I want to replace my curly quotes with straight quotes.
$line =~ s/“|”/"/g;
That should work. I'm replacing one character with another. What could go wrong?
If this is a UTF-8 file (what Macs and Linux systems use by default), those curly quotes are two-byte characters and that straight quote is a single byte character. I would be writing back a line that was shorter than the line I read in. My buffer is going to be off.
Back in the days when computer memory and storage were measured in kilobytes, and you serial devices like reel-to-reel tapes, this type of operation was quite common. However, in this age where storage is vast, it's simply not worth the complexity and error prone process that this entails. Stick with reading from one file, and writing to another. Then use unlink and rename to delete the original and to rename the copy to the original's name.
A few more pointers:
Don't print if the file can't be opened. Use die. Otherwise, your program will simply continue on blithely unaware that it is not working. Even better, use the pragma use autodie;, and you won't have to worry about testing whether or not a read/write failed.
Use scalars for file handles.
That is instead of
open OUT, ">my_file.txt";
use
open my $out_fh, ">my_file.txt";
And, it is highly recommended to use the three parameter open:
Use
open my $out_fh, ">", "my_file.txt";
If you aren't, always add use strict; and use warnings;.
In fact, your Perl syntax is a bit ancient. You need to get a book on Modern Perl. Perl originally was written as a hack language to replace shell and awk programming. However, Perl has morphed into a full fledge language that can handle complex data types, object orientation, and large projects. Learning the modern syntax of Perl will help you find errors, and become a better developer.
1. Like all rules, this can be broken, but only if you have a clear and careful understanding what is going on. It's like those shows that say "Don't do this at home. We're professionals."
sub inplace_expansion($$$){
my $f = shift(#_) ; # file Name
my $vname = shift(#_) ; # variable name for pattern match
my $value = shift(#_) ; # value to replace
local #ARGV = ( $f );
local $^I = '';
while (<>)
{
s/\Q$vname/$value/g; #check for pattern
print;
}
}
or, my preference would run closer to this (basically equivalent, changes mostly in formatting, variable names, etc.):
use English;
sub inplace_expansion {
my ( $filename, $pattern, $replacement ) = #_;
local #ARGV = ( $filename ),
$INPLACE_EDIT = '';
while ( <> ) {
s/\Q$pattern/$replacement/g;
print;
}
}
The trick with local basically simulates a command-line script (as one would run with perl -e); for more details, see perldoc perlrun. For more on $^I (aka $INPLACE_EDIT), see perldoc perlvar.
(For the business with \Q (in the s// expression), see perldoc -f quotemeta. This is unrelated to your question, but good to know. Also be aware that passing regex patterns around in variables—as opposed to, e.g., using literal regexes exclusively— can be vulnerable to injection attacks; Perl's built-in taint mode is useful here.)
EDIT: David W. is right about prototypes.

Escape whitespace when using backticks

I've had a search around, and from my perspective using backticks is the only way I can solve this problem. I'm trying to call the mdls command from Perl for each file in a directory to find it's last accessed time. The issue I'm having is that in the file names I have from find I have unescaped spaces which bash obviously doesn't like. Is there an easy way to escape all of the white space in my file names before passing them to mdls. Please forgive me if this is an obvious question. I'm quite new to Perl.
my $top_dir = '/Volumes/hydrogen/FLAC';
sub wanted { # Learn about sub routines
if ($File::Find::name) {
my $curr_file_path = $File::Find::name. "\n";
`mdls $curr_file_path`;
print $_;
}
}
find(\&wanted, $top_dir);
If you are JUST wanting "last access time" in terms of of the OS last access time, mdls is the wrong tool. Use perl's stat. If you want last access time in terms of the Mac registered application (ie, a song by Quicktime or iTunes) then mdls is potentially the right tool. (You could also use osascript to query the Mac app directly...)
Backticks are for capturing the text return. Since you are using mdls, I assume capturing and parsing the text is still to come.
So there are several methods:
Use the list form of system and the quoting is not necessary (if you
don't care about the return text);
Use String::ShellQuote to escape the file name before sending to sh;
Build the string and enclose in single quotes prior to sending to sending to the shell. This is harder than it sounds because files names with single quotes defeats your quotes! For example, sam's song.mp4 is a legal file name, but if you surround with single quotes you get 'sam's song.mp4' which is not what you meant...
Use open to open a pipe to the output of the child process like this: open my $fh, '-|', "mdls", "$curr_file" or die "$!";
Example of String::ShellQuote:
use strict; use warnings;
use String::ShellQuote;
use File::Find;
my $top_dir = '/Users/andrew/music/iTunes/iTunes Music/Music';
sub wanted {
if ($File::Find::name) {
my $curr_file = "$File::Find::name";
my $rtr;
return if -d;
my $exec="mdls ".shell_quote($curr_file);
$rtr=`$exec`;
print "$rtr\n\n";
}
}
find(\&wanted, $top_dir);
Example of pipe:
use strict; use warnings;
use String::ShellQuote;
use File::Find;
my $top_dir = '/Users/andrew/music/iTunes/iTunes Music/Music';
sub wanted {
if ($File::Find::name) {
my $curr_file = "$File::Find::name";
my $rtr;
return if -d;
open my $fh, '-|', "mdls", "$curr_file" or die "$!";
{ local $/; $rtr=<$fh>; }
close $fh or die "$!";
print "$rtr\n\n";
}
}
find(\&wanted, $top_dir);
If you're sure the filenames don't contain newlines (either CR or LF), then pretty much all Unix shells accept backslash quoting, and Perl has the quotemeta function to apply it.
my $curr_file_path = quotemeta($File::Find::name);
my $time = `mdls $curr_file_path`;
Unfortunately, that doesn't work for filenames with newlines, because the shell handles a backslash followed by a newline by deleting both characters instead of just the backslash. So to be really safe, use String::ShellQuote:
use String::ShellQuote;
...
my $curr_file_path = shell_quote($File::Find::name);
my $time = `mdls $curr_file_path`;
That should work on filenames containing anything except a NUL character, which you really shouldn't be using in filenames.
Both of these solutions are for Unix-style shells only. If you're on Windows, proper shell quoting is much trickier.
If you just want to find the last access time, is there some weird Mac reason you aren't using stat? When would it be worse than kMDItemLastUsedDate?
my $last_access = ( stat($file) )[8];
It seems kMDItemLastUsedDate isn't always updated to the last access time. If you work with a file through the terminal (e.g. cat, more), kMDItemLastUsedDate doesn't change but the value that comes back from stat is right. touch appears to do the right thing in both cases.
It looks like you need stat for the real answer, but mdls if you're looking for access through applications.
You can bypass the shell by expressing the command as a list, combined with capture() from IPC::System::Simple:
use IPC::System::Simple qw(capture);
my $output = capture('mdls', $curr_file_path);
Quote the variable name inside the backticks:
`mdls "$curr_file_path"`;
`mdls '$curr_file_path'`;

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}