Perl Globbing a Variable stops on first match - perl

I'm trying to loop over some files in my directory in Perl. Let's say my current directory contains: song0.txt, song1.txt, song2.txt, song3.txt, song4.txt.
I supply "song?.txt" as an argument to my program.
When I do:
foreach $file (glob "$ARGV[0]") {
printf "$file\n";
}
It stops after printing "song0.txt".
However, if I replace "$ARGV[0]" with "song?.txt", it prints out all 5 of them as expected. Why doesn't Perl glob work with variables and what can I do to fix this?

When you call your program with song?.txt the shell expands that ? so
prog.pl song?.txt --> prog.pl song0.txt song1.txt ...
Thus "$ARGV[0]" in the program is song0.txt and there is nothing for Perl's glob to do with it.
So you'd either do
foreach my $file (#ARGV) { }
and call the program with prog.pl song?.txt, or do the globbing in Perl
foreach my $file (glob "song?.txt") { ... }
where now Perl's glob will construct the list of files using ? in the pattern.
Which of the two is "better" depends on the context. But I'd rather submit to a program a straight-up list of files, if that is an equal option, than get entangled in glob-ing patterns in the program.
Also note that Perl's glob is an ancient "demon", with "interesting" behaviors in some cases.

Related

Perl in place editing within a script (rather than one liner)

So, I'm used to the perl -i to use perl as I would sed and in place edit.
The docs for $^I in perlvar:
$^I
The current value of the inplace-edit extension. Use undef to disable inplace editing.
OK. So this implies that I can perhaps mess around with 'in place' editing in a script?
The thing I'm having trouble with is this:
If I run:
perl -pi -e 's/^/fish/' test_file
And then deparse it:
BEGIN { $^I = ""; }
LINE: while (defined($_ = <ARGV>)) {
s/^/fish/;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Now - if I were to want to use $^I within a script, say to:
foreach my $file ( glob "*.csv" ) {
#inplace edit these files - maybe using Text::CSV to manipulate?
}
How do I 'enable' this to happen? Is it a question of changing $_ (as s/something/somethingelse/ does by default) and letting perl implicitly print it? Or is there something else going on?
My major question is - can I do an 'in place edit' that applies a CSV transform (or XML tweak, or similar).
I appreciate I can open separate file handles, read/print etc. I was wondering if there was another way. (even if it is only situationally useful).
The edit-in-place behaviour that is enabled by the -i command-line option or by setting $^I works only on the ARGV file handle. That means the files must either be named on the command line or #ARGV must be set up within the program
This program will change all lower-case letters to upper-case in all CSV files. Note that I have set $^I to a non-null string, which is advisable while you are testing so that your original data files are retained
use strict;
use warnings;
our $^I = '.bak';
while ( my $file = glob '*.csv' ) {
print "Processing $file\n";
our #ARGV = ($file);
while ( <ARGV> ) {
tr/a-z/A-Z/;
print;
}
}
There is a much simpler answer, if your script is always going to do in-place editing and your OS uses shebang:
#!perl -i
while (<>) {
print "LINE: $_"
}
Will add 'LINE: ' at the beginning of a line for each file it's given. (Note that you'd probably use the full path to perl, i.e., "#!/usr/bin/perl -i")
You can also call your script as:
% perl -i <script> <file1> <file2> ...
To run script as an in-place editor on file1, file2, etc.., if you don't have shebang support.

Recursive directory traversal in perl

I intend to recursively traverse a directory containing this piece of perl script.
The idea is to traverse all directories whose parent directory contains the perl script and list all files path into a single array variable. Then return the list.
Here comes the error msg:
readdir() attempted on invalid dirhandle $DIR at xxx
closedir() attempted on invalid dirhandle $DIR at xxx
Code is attached for reference, Thank you in advance.
use strict;
use warnings;
use Cwd;
our #childfile = ();
sub recursive_dir{
my $current_dir = $_[0]; # a parameter
opendir(my $DIR,$current_dir) or die "Fail to open current directory,error: $!";
while(my $contents = readdir($DIR)){
next if ($contents =~ m/^\./); # filter out "." and ".."
#if-else clause separate dirs from files
if(-d "$contents"){
#print getcwd;
#print $contents;
#closedir($DIR);
recursive_dir(getcwd."/$contents");
print getcwd."/$contents";
}
else{
if($contents =~ /(?<!\.pl)$/){
push(#childfile,$contents);
}
}
}
closedir($DIR);
#print #childfile;
return #childfile;
}
recursive_dir(getcwd);
Please tell us if this is homework? You are welcome to ask for help with assignments, but it changes the sort of answer you should be given.
You are relying on getcwd to give you the current directory that you are processing, yet you never change the current working directory so your program will loop endlessly and eventually run out of memory. You should simply use $current_dir instead.
I don't believe that those error messages can be produced by the program you show. Your code checks whether opendir has succeeded and the program dies unless $DIR is valid, so the subsequent readdir and closedir must be using a valid handle.
Some other points:
Comments like # a parameter are ridiculous and only serve to clutter your code
Upper-case letters are generally reserved for global identifiers like package names. And $dir is a poor name for a directory handle, as it could also mean the directory name or the directory path. Use $dir_handle or $dh
It is crazy to use a negative look-behind just to check that a file name doesn't end with .pl. Just use push #childfile, $contents unless $contents =~ /\.pl$/
You never use the return value from your subroutine, so it is wasteful of memory to return what could be an enormous array from every call. #childfile is accessible throughout the program so you can just access it directly from anywhere
Don't put scalar variables inside double quotes. It simply forces the value to a string, which is probably unnecessary and may cause arcane bugs. Use just -d $contents
You probably want to ignore symbolic links, as otherwise you could be looping endlessly. You should change else { ... } to elsif (-f $contents) { ... }

What does -l $_ do in Perl, and how does it work?

What is the meaning of the nest code
foreach (#items)
{
if (-l $_) ## this is what I don't understand: the meaning of -l
{
...
}
}
Thanks for any help.
Let's look at each thing:
foreach (#items) {
...
}
This for loop (foreach and for are the same command in Perl) is taking each item from the #items list, and setting it to $_. The $_ is a special variable in Perl that is used as sort of a default variable. The idea is that you could do things like this:
foreach (#items) {
s/foo/bar/;
uc;
print;
}
And each of those command would operate on that $_ variable! If you simply say print with nothing else, it would print whatever is in $_. If you say uc and didn't mention a variable, it would uppercase whatever is in $_.
This is now discouraged for several reasons. First, $_ is global, so there might be side effects that are not intended. For example, imagine you call a subroutine that mucked with the value of $_. You would suddenly be surprised that your program doesn't work.
The other -l is a test operator. This operator checks whether the file given is a symbolic link or not. I've linked to the Perldoc that explains all of the test operators.
If you're not knowledgeable in Unix or BASH/Korn/Bourne shell scripting, having a command that starts with a dash just looks weird. However, much of Perl's syntax was stolen... I mean borrowed from Unix shell and awk commands. In Unix, there's a command called test which you can use like this:
if test -L $FILE
then
....
fi
In Unix, that -L is a parameter to the test command, and in Unix, most parameters to commands start with dashes. Perl simply borrowed the same syntax dash and all.
Interestingly, if you read the Perldoc for these test commands, you will notice that like the foreach loop, the various test commands will use the $_ variable if you don't give it a variable or file name. Whoever wrote that script could have written their loop like this:
foreach (#items)
{
if (-l) ## Notice no mention of the `$_` variable
{
...
}
}
Yeah, that's soooo much clear!
Just for your information, The modern way as recommended by many Perl experts (cough Damian Conway cough) is to avoid the $_ variable whenever possible since it doesn't really add clarity and can cause problems. He also recommends just saying for and forgetting foreach, and using curly braces on the same line:
for my $file (#items) {
if ( -l $file ) {
...
}
}
That might not help with the -l command, but at least you can see you're dealing with files, so you might suspect that -l has something to do with files.
Unfortunately, the Perldoc puts all of these file tests under the -X section and alphabetized under X, so if you're searching the Perldoc for a -l command, or any command that starts with a dash, you won't find it unless you know. However, at least you know now for the future where to look when you see something like this: -s $file.
It's an operator that checks if a file is a symbolic link.
The -l filetest operator checks whether a file is a symbolic link.
The way -l works under the hood resembles the code below.
#! /usr/bin/env perl
use strict;
use warnings;
use Fcntl ':mode';
sub is_symlink {
my($path) = #_;
my $mode = (lstat $path)[2];
die "$0: lstat $path: $!" unless defined $mode;
return S_ISLNK $mode;
}
my #items = #ARGV;
foreach (#items) {
if (is_symlink $_) {
print "$0: link: $_\n";
}
}
Sample output:
$ ln -s foo/bar/baz quux
$ ./flag-links flag-links quux
./flag-links: link: quux
Note the call to lstat and not stat because the latter would attempt to follow symlinks but never identify them!
To understand how Unix mode bits work, see the accepted answer to “understanding and decoding the file mode value from stat function output.”
From perldoc :
-l File is a symbolic link.

Perl Porter Stemmer

I was checking this porter stemmer. Below they said I should change my first line. To what exactly I tried every thing but the stemmer ain't working. What a good example might be?
#!/usr/local/bin/perl -w
#
# Perl implementation of the porter stemming algorithm
# described in the paper: "An algorithm for suffix stripping, M F Porter"
# http://www.muscat.com/~martin/stem.html
#
# Daniel van Balen (vdaniel#ldc.usb.ve)
#
# October-1999
#
# To Use:
#
# Put the line "use porter;" in your code. This will import the subroutine
# porter into your current name space (by default this is Main:: ). Make
# sure this file, "porter.pm" is in your #INC path (it includes the current
# directory).
# Afterwards use by calling "porter(<word>)" where <word> is the word to strip.
# The stripped word will be the returned value.
#
# REMEMBER TO CHANGE THE FIRST LINE TO POINT TO THE PATH TO YOUR PERL
# BINARY
#
As A code I am writing what follows:
use Lingua::StopWords qw(getStopWords);
use Main::porter;
my $stopwords = getStopWords('en');
#stopwords = grep { $stopwords->{$_} } (keys %$stopwords);
chdir("c:/perl/input");
#files = <*>;
foreach $file (#files)
{
open (input, $file);
while (<input>)
{
open (output,">>c:/perl/normalized/".$file);
chomp;
porter<$_>;
for my $stop (#stopwords)
{
s/\b\Q$stop\E\b//ig;
}
$_ =~s/<[^>]*>//g;
$_ =~ s/[[:punct:]]//g;
print output "$_\n";
}
}
close (input);
close (output);
The code gives no errors except it is not stemming anything!!!
That comment block is full of incorrect advice.
A #! line in a .pm file has no effect. It's a common mistake. The #! line tells Unix which interpreter to run the program with if and only if you run the file as a command line program.
./somefile # uses #! to determine what to run somefile with
/usr/bin/perl somefile # runs somefile with /usr/bin/perl regardless of #!
The #! line does nothing in a module, a .pm file which you use. Perl is already running at that point. The line is nothing but a comment.
The second problem is that your default namespace is main not Main. Casing matters.
Moving on to your code, use Main::porter; should not work. It should be use porter. You should get an error message like Can't locate Main/porter.pm in #INC (#INC contains: ...). If that code runs, perhaps you moved porter.pm into a Main/ directory? Move it out, it will confuse the importing of the porter function.
porter<$_>; says "try to read a line from the filehandle $_ and pass that into porter". $_ isn't a filehandle, it's a line from the file you just opened. You want porter($_) to pass the line into the porter function. If you turn on warnings (add use warnings to the top of your script) Perl will warn you about mistakes like that.
You'll also presumably want to do something with the return value from porter, otherwise it will truly do nothing. my #whatever_porter_returns = porter($_).
Likely one or more of your chdir or opens have silently failed so your program may have no input. Unfortunately, Perl does not let you know when this happens, you have to check. Normally you add an or die $! after the function to check for the error. This is busy work and often one forgets, instead you can use autodie which will automatically produce an error if any system calls like chdir or open fail.
With that stuff fixed your code should work, or at least produce useful error messages.
Finally, there are many stemming modules on CPAN which are likely to be higher quality than the one you've found with documentation and tests and updates and all that. Lingua::Stem and Text::English specifically use the porter algorithm. You might want to give those a shot.

Is there a simple way to do bulk file text substitution in place?

I've been trying to code a Perl script to substitute some text on all source files of my project. I'm in need of something like:
perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi" *.{cs,aspx,ascx}
But that parses all the files of a directory recursively.
I just started a script:
use File::Find::Rule;
use strict;
my #files = (File::Find::Rule->file()->name('*.cs','*.aspx','*.ascx')->in('.'));
foreach my $f (#files){
if ($f =~ s/thisgoesout/thisgoesin/gi) {
# In-place file editing, or something like that
}
}
But now I'm stuck. Is there a simple way to edit all files in place using Perl?
Please note that I don't need to keep a copy of every modified file; I'm have 'em all subversioned =)
Update: I tried this on Cygwin,
perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi" {*,*/*,*/*/*}.{cs,aspx,ascx
But it looks like my arguments list exploded to the maximum size allowed. In fact, I'm getting very strange errors on Cygwin...
If you assign #ARGV before using *ARGV (aka the diamond <>), $^I/-i will work on those files instead of what was specified on the command line.
use File::Find::Rule;
use strict;
#ARGV = (File::Find::Rule->file()->name('*.cs', '*.aspx', '*.ascx')->in('.'));
$^I = '.bak'; # or set `-i` in the #! line or on the command-line
while (<>) {
s/thisgoesout/thisgoesin/gi;
print;
}
This should do exactly what you want.
If your pattern can span multiple lines, add in a undef $/; before the <> so that Perl operates on a whole file at a time instead of line-by-line.
You may be interested in File::Transaction::Atomic or File::Transaction
The SYNOPSIS for F::T::A looks very similar with what you're trying to do:
# In this example, we wish to replace
# the word 'foo' with the word 'bar' in several files,
# with no risk of ending up with the replacement done
# in some files but not in others.
use File::Transaction::Atomic;
my $ft = File::Transaction::Atomic->new;
eval {
foreach my $file (#list_of_file_names) {
$ft->linewise_rewrite($file, sub {
s#\bfoo\b#bar#g;
});
}
};
if ($#) {
$ft->revert;
die "update aborted: $#";
}
else {
$ft->commit;
}
Couple that with the File::Find you've already written, and you should be good to go.
You can use Tie::File to scalably access large files and change them in place. See the manpage (man 3perl Tie::File).
Change
foreach my $f (#files){
if ($f =~ s/thisgoesout/thisgoesin/gi) {
#inplace file editing, or something like that
}
}
To
foreach my $f (#files){
open my $in, '<', $f;
open my $out, '>', "$f.out";
while (my $line = <$in>){
chomp $line;
$line =~ s/thisgoesout/thisgoesin/gi
print $out "$line\n";
}
}
This assumes that the pattern doesn't span multiple lines. If the pattern might span lines, you'll need to slurp in the file contents. ("slurp" is a pretty common Perl term).
The chomp isn't actually necessary, I've just been bitten by lines that weren't chomped one too many times (if you drop the chomp, change print $out "$line\n"; to print $out $line;).
Likewise, you can change open my $out, '>', "$f.out"; to open my $out, '>', undef; to open a temporary file and then copy that file back over the original when the substitution's done. In fact, and especially if you slurp in the whole file, you can simply make the substitution in memory and then write over the original file. But I've made enough mistakes doing that that I always write to a new file, and verify the contents.
Note, I originally had an if statement in that code. That was most likely wrong. That would have only copied over lines that matched the regular expression "thisgoesout" (replacing it with "thisgoesin" of course) while silently gobbling up the rest.
You could use find:
find . -name '*.{cs,aspx,ascx}' | xargs perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi"
This will list all the filenames recursively, then xargs will read its stdin and run the remainder of the command line with the filenames appended on the end. One nice thing about xargs is it will run the command line more than once if the command line it builds gets too long to run in one go.
Note that I'm not sure whether find completely understands all the shell methods of selecting files, so if the above doesn't work then perhaps try:
find . | grep -E '(cs|aspx|ascx)$' | xargs ...
When using pipelines like this, I like to build up the command line and run each part individually before proceeding, to make sure each program is getting the input it wants. So you could run the part without xargs first to check it.
It just occurred to me that although you didn't say so, you're probably on Windows due to the file suffixes you're looking for. In that case, the above pipeline could be run using Cygwin. It's possible to write a Perl script to do the same thing, as you started to do, but you'll have to do the in-place editing yourself because you can't take advantage of the -i switch in that situation.
Thanks to ephemient on this question and on this answer, I got this:
use File::Find::Rule;
use strict;
sub ReplaceText {
my $regex = shift;
my $replace = shift;
#ARGV = (File::Find::Rule->file()->name('*.cs','*.aspx','*.ascx')->in('.'));
$^I = '.bak';
while (<>) {
s/$regex/$replace->()/gie;
print;
}
}
ReplaceText qr/some(crazy)regexp/, sub { "some $1 text" };
Now I can even loop through a hash containing regexp=>subs entries!