In Perl, why does the `while(<HANDLE>) {...}` construct not localize `$_`? - perl

What was the design (or technical) reason for Perl not automatically localizing $_ with the following syntax:
while (<HANDLE>) {...}
Which gets rewritten as:
while (defined( $_ = <HANDLE> )) {...}
All of the other constructs that implicitly write to $_ do so in a localized manner (for/foreach, map, grep), but with while, you must explicitly localize the variable:
local $_;
while (<HANDLE>) {...}
My guess is that it has something to do with using Perl in "Super-AWK" mode with command line switches, but that might be wrong.
So if anyone knows (or better yet was involved in the language design discussion), could you share with us the reasoning behind this behavior? More specifically, why was allowing the value of $_ to persist outside of the loop deemed important, despite the bugs it can cause (which I tend to see all over the place on SO and in other Perl code)?
In case it is not clear from the above, the reason why $_ must be localized with while is shown in this example:
sub read_handle {
while (<HANDLE>) { ... }
}
for (1 .. 10) {
print "$_: \n"; # works, prints a number from 1 .. 10
read_handle;
print "done with $_\n"; # does not work, prints the last line read from
# HANDLE or undef if the file was finished
}

From the thread on perlmonks.org:
There is a difference between foreach
and while because they are two totally
different things. foreach always
assigns to a variable when looping
over a list, while while normally
doesn't. It's just that while (<>) is
an exception and only when there's a
single diamond operator there's an
implicit assignment to $_.
And also:
One possible reason for why while(<>)
does not implicitly localize $_ as
part of its magic is that sometimes
you want to access the last value of
$_ outside the loop.

Quite simply, while never localises. No variable is associated with a while construct, so it doesn't have even have anything to localise.
If you change some variable in the while loop expression or in a while loop body, it's your responsibility to adequately scope it.

Speculation: Because for and foreach are iterators and loop over values, while while operates on a condition. In the case of while (<FH>) the condition is that data was read from the file. The <FH> is what writes to $_, not the while. The implicit defined() test is just an affordance to prevent naive code from terminating the loop on a read of false value.
For other forms of while loops, e.g. while (/foo/) you wouldn't want to localize $_.
While I agree that it would be nice if while (<FH>) localized $_, it would have to be a very special case, which could cause other problems with recognizing when to trigger it and when not to, much like the rules for <EXPR> distinguishing being a handle read or a call to glob.

As a side note, we only write while(<$fh>) because Perl doesn't have real iterators. If Perl had proper iterators, <$fh> would return one. for would use that to iterate a line at a time rather than slurping the whole file into an array. There would be no need for while(<$fh>) or the special cases associated with it.

Related

what does print for mean in Perl?

I need to edit some Perl script and I'm new to this language.
I encountered the following statement:
print for (#$result);
I know that $result is a reference to an array and #$result returns the whole array.
But what does print for mean?
Thank you in advance.
In Perl, there's such a thing as an implicit variable. You may have seen it already as $_. There's a lot of built in functions in perl that will work on $_ by default.
$_ is set in a variety of places, such as loops. So you can do:
while ( <$filehandle> ) {
chomp;
tr/A-Z/a-z/;
s/oldword/newword/;
print;
}
Each of these lines is using $_ and modifying it as it goes. Your for loop is doing the same - each iteration of the loop sets $_ to the current value and print is then doing that by default.
I would point out though - whilst useful and clever, it's also a really good way to make confusing and inscrutable code. In nested loops, for example, it can be quite unclear what's actually going on with $_.
So I'd typically:
avoid writing it explicitly - if you need to do that, you should consider actually naming your variable properly.
only use it in places where it makes it clearer what's going on. As a rule of thumb - if you use it more than twice, you should probably use a named variable instead.
I find it particularly useful if iterating on a file handle. E.g.:
while ( <$filehandle> ) {
next unless m/keyword/; #skips any line without 'keyword' in it.
my ( $wiggle, $wobble, $fronk ) = split ( /:/ ); #split $_ into 3 variables on ':'
print $wobble, "\n";
}
It would be redundant to assign a variable name to capture a line from <$filehandle>, only to immediately discard it - thus instead we use split which by default uses $_ to extract 3 values.
If it's hard to figure out what's going on, then one of the more useful ways is to use perl -MO=Deparse which'll re-print the 'parsed' version of the script. So in the example you give:
foreach $_ (#$result) {
print $_;
}
It is equivalent to for (#$result) { print; }, which is equivalent to for (#$result) { print $_; }. $_ refers to the current element.

How to tell perl to print to a file handle instead of printing the file handle?

I'm trying to wrap my head around the way Perl handles the parsing of arguments to print.
Why does this
print $fh $stufftowrite
write to the file handle as expected, but
print($fh, $stufftowrite)
writes the file handle to STDOUT instead?
My guess is that it has something to do with the warning in the documentation of print:
Be careful not to follow the print keyword with a left parenthesis unless you want the corresponding right parenthesis to terminate the arguments to the print; put parentheses around all arguments (or interpose a + , but that doesn't look as good).
Should I just get used to the first form (which just doesn't seem right to me, coming from languages that all use parentheses around function arguments), or is there a way to tell Perl to do what I want?
So far I've tried a lot of combination of parentheses around the first, second and both parameters, without success.
On lists
The structure bareword (LIST1), LIST2 means "apply the function bareword to the arguments LIST1", while bareword +(LIST1), LIST2 can, but doesn't neccessarily mean "apply bareword to the arguments of the combined list LIST1, LIST2". This is important for grouping arguments:
my ($a, $b, $c) = (0..2);
print ($a or $b), $c; # print $b
print +($a or $b), $c; # print $b, $c
The prefix + can also be used to distinguish hashrefs from blocks, and functions from barewords, e.g. when subscripting an hash: $hash{shift} returns the shift element, while $hash{+shift} calls the function shift and returns the hash element of the value of shift.
Indirect syntax
In object oriented Perl, you normally call methods on an object with the arrow syntax:
$object->method(LIST); # call `method` on `$object` with args `LIST`.
However, it is possible, but not recommended, to use an indirect notation that puts the verb first:
method $object (LIST); # the same, but stupid.
Because classes are just instances of themselves (in a syntactic sense), you can also call methods on them. This is why
new Class (ARGS); # bad style, but pretty
is the same as
Class->new(ARGS); # good style, but ugly
However, this can sometimes confuse the parser, so indirect style is not recommended.
But it does hint on what print does:
print $fh ARGS
is the same as
$fh->print(ARGS)
Indeed, the filehandle $fh is treated as an object of the class IO::Handle.
(While this is a valid syntactic explanation, it is not quite true. The source of IO::Handle itself uses the line print $this #_;. The print function is just defined this way.)
Looks like you have a typo. You have put a comma between the file handle and the argument in the second print statement. If you do that, the file handle will be seen as an argument. This seems to apply only to lexical file handles. If done with a global file handle, it will produce the fatal error
No comma allowed after filehandle at ...
So, to be clear, if you absolutely have to have parentheses for your print, do this:
print($fh $stufftowrite)
Although personally I prefer to not use parentheses unless I have to, as they just add clutter.
Modern Perl book states in the Chapter 11 ("What to Avoid"), section "Indirect Notation Scalar Limitations":
Another danger of the syntax is that the parser expects a single scalar expression as the object. Printing to a filehandle stored in an aggregate variable seems obvious, but it is not:
# DOES NOT WORK AS WRITTEN
say $config->{output} 'Fun diagnostic message!';
Perl will attempt to call say on the $config object.
print, close, and say—all builtins which operate on filehandles—operate in an indirect fashion. This was fine when filehandles were package globals, but lexical filehandles (Filehandle References) make the indirect object syntax problems obvious. To solve this, disambiguate the subexpression which produces the intended invocant:
say {$config->{output}} 'Fun diagnostic message!';
Of course, print({$fh} $stufftowrite) is also possible.
It's how the syntax of print is defined. It's really that simple. There's kind of nothing to fix. If you put a comma between the file handle and the rest of the arguments, the expression is parsed as print LIST rather than print FILEHANDLE LIST. Yes, that looks really weird. It is really weird.
The way not to get parsed as print LIST is to supply an expression that can legally be parsed as print FILEHANDLE LIST. If what you're trying to do is get parentheses around the arguments to print to make it look more like an ordinary function call, you can say
print($fh $stufftowrite); # note the lack of comma
You can also say
(print $fh $stufftowrite);
if what you're trying to do is set off the print expression from surrounding code. The key point is that including the comma changes the parse.

How to get rid of use of an uninitialized value within an 'if' construct using a Perl regex

How do I get rid of use of an uninitialized value within an if construct using a Perl regex?
When using the code below, I get use of uninitialized value messages.
if($arrayOld[$i] =~ /-(.*)/ || $arrayOld[$i] =~ /\#(.*)/)
When using the code below, I get no output.
if(defined($arrayOld[$i]) =~ /-(.*)/ || defined($arrayOld[$i]) =~ /\#(.*)/)
What is the proper way to check if a variable has a value given the code above?
Try:
if($arrayOld[$i] && $arrayOld[$i] =~ /-|\#(.*)/)
This first checks $arrayOld[$i] for a value before running a regx against it.
(Have also combined the || into the regex.)
From the error message in your comment, you're accessing an element of #arrayOld that isn't defined. Without seeing the rest of the code, this could indicate a bug in your program, or it could just be expected behavior.
If you understand why $arrayOld[$i] is undef, and you want to allow that without getting a warning, there's a couple of things you can do. Perl 5.10.0 introduced the defined-or operator //, which you can use to substitute the empty string for undef:
use 5.010;
...
if(($arrayOld[$i] // '') =~ /-(.*)/ || ($arrayOld[$i] // '') =~ /\#(.*)/)
Or, you can just turn off the warning:
if (do { no warnings 'uninitalized';
$arrayOld[$i] =~ /-(.*)/ || $arrayOld[$i] =~ /\#(.*)/ })
Here, I'm using do to limit the time the warning is disabled. However, turning off the warning also suppresses the warning you'd get if $i were undef. Using // allows you to specify exactly what is allowed to be undef, and exactly what value should be used instead of undef.
Note: defined($arrayOld[$i]) =~ /-(.*)/ is running a pattern match on the result of the defined function, which is just going to be a true/false value; not the string you want to test.
To answer your question narrowly, you can prevent undefined-value warnings in that line of code with
if (defined $i && defined $arrayOld[$i]
&& ($arrayOld[$i] =~ /-(.*)/ || $arrayOld[$i] =~ /\#(.*)/))
{
...;
}
That is, evaluating either $i or the expression $arrayOld[$i] may result in an undefined value. Note the additional layer of parentheses that are necessary as written above because of the difference in precedence between && and ||, with the former binding more tightly. For the particular patterns in your question, you could sidestep this precedence issue by combining your patterns into one regex, but this can be tricky to do in the general case.
I recommend against using the unpleasing code above. Read on to see an elegant solution to your problem that has Perl do the work for you and is much easier to read.
Looking back
From the slightly broader context of your earlier question, $i is a loop variable and by construction will certainly be defined, so testing $i is overkill. Your code blindly pulls elements from #arrayOld, and Perl happily obliges. In cases where nothing is there, you get the undefined value.
This sort of one-by-one peeking and poking is common in C programs, but in Perl, it is almost always a red flag that you could express your algorithm more elegantly. Consider the complete, working example below.
Working demonstration
#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.0; # given/when
*FILEREAD = *DATA; # for demo only
my #interesting_line = (qr/-(.*)/, qr/\#(.*)/);
$/ = ""; # paragraph mode
while(<FILEREAD>) {
chomp;
my #arrayOld = split /\n/;
my #arrayNewLines;
for (1 .. #arrayOld) {
given (shift #arrayOld) {
push #arrayNewLines, $_ when #interesting_line;
push #arrayOld, $_;
}
}
print "\#arrayOld:\n", map("$_\n", #arrayOld), "\n",
"\#arrayNewLines:\n", map("$_\n", #arrayNewLines);
}
__DATA__
#SCSI_test # put this line into #arrayNewLines
kdkdkdkdkdkdkdkd
dkdkdkdkdkdkdkdkd
- ccccccccccccccc # put this line into #arrayNewLines
Front matter
The line
use 5.10.0;
enables Perl’s given/when switch statement, and this makes for a nice way to decide which array gets a given line of input.
As the comment indicates
*FILEREAD = *DATA; # for demo only
is for the purpose of this Stack Overflow demonstration. In your real code, you have open FILEREAD, .... Placing the input from your question into Perl’s DATA filehandle allows presenting code and input in one self-contained unit, and then we alias FILEREAD to DATA so the rest of the code will drop into yours with no fuss.
The main event
The core of the processing is
for (1 .. #arrayOld) {
given (shift #arrayOld) {
push #arrayNewLines, $_ when #interesting_line;
push #arrayOld, $_;
}
}
Notice that there are no defined checks or even explicit regex matches! There’s no $i or $arrayOld[$i]! What’s going on?
You start with #arrayOld containing all the lines from the current paragraph and want to end with the interesting lines in #arrayNewLines and everything else staying in #arrayOld. The code above takes the next line out of #arrayOld with shift. If the line is interesting, we push it onto the end of #arrayNewLines. Otherwise, we put it back on the end of #arrayOld.
The statement modifier when #interesting_line performs an implicit smart-match with the topic from given. As explained in “Smart matching in detail,” when smart matching against an array, Perl implicitly loops over it and stops on the first match. In this case, the array #interesting_line contains compiled regexes that match lines you want to move to #arrayNewLines. If the current line (in $_ thanks to given) does not match any of those patterns, it goes back in #arrayOld.
We do the preceding process exactly scalar #arrayOld times, that is, once for each line in the current paragraph. This way, we process everything exactly once and do not have to worry about fussy bookkeeping over where the current array index is. Whatever is left in #arrayOld after that many shifts must be the lines we pushed back onto it, which are the uninteresting lines in the order that the occurred in the input.
Sample output
For the input in your question, the output is
#arrayOld:
kdkdkdkdkdkdkdkd
dkdkdkdkdkdkdkdkd
#arrayNewLines:
#SCSI_test # put this line into #arrayNewLines
- ccccccccccccccc # put this line into #arrayNewLines

How is $_ different from named input or loop arguments?

As I use $_ a lot I want to understand its usage better. $_ is a global variable for implicit values as far as I understood and used it.
As $_ seems to be set anyway, are there reasons to use named loop variables over $_ besides readability?
In what cases does it matter $_ is a global variable?
So if I use
for (#array){
print $_;
}
or even
print $_ for #array;
it has the same effect as
for my $var (#array){
print $var;
}
But does it work the same? I guess it does not exactly but what are the actual differences?
Update:
It seems $_ is even scoped correctly in this example. Is it not global anymore? I am using 5.12.3.
#!/usr/bin/perl
use strict;
use warnings;
my #array = qw/one two three four/;
my #other_array = qw/1 2 3 4/;
for (#array){
for (#other_array){
print $_;
}
print $_;
}
that prints correctly 1234one1234two1234three1234four.
For global $_ I would have expected 1234 4 1234 4 1234 4 1234 4 .. or am i missing something obvious?
When is $_ global then?
Update:
Ok, after having read the various answers and perlsyn more carefully I came to a conclusion:
Besides readability it is better to avoid using $_ because implicit localisation of $_ must be known and taken account of otherwise one might encounter unexpected behaviour.
Thanks for clarification of that matter.
are there reasons to use named loop variables over $_ besides readability?
The issue is not if they are named or not. The issue is if they are "package variables" or "lexical variables".
See the very good description of the 2 systems of variables used in Perl "Coping with Scoping":
http://perl.plover.com/FAQs/Namespaces.html
package variables are global variables, and should therefore be avoided for all the usual reasons (eg. action at a distance).
Avoiding package variables is a question of "correct operation" or "harder to inject bugs" rather than a question of "readability".
In what cases does it matter $_ is a global variable?
Everywhere.
The better question is:
In what cases is $_ local()ized for me?
There are a few places where Perl will local()ize $_ for you, primarily foreach, grep and map. All other places require that you local()ize it yourself, therefore you will be injecting a potential bug when you inevitably forget to do so. :-)
The classic failure mode of using $_ (implicitly or explicitly) as a loop variable is
for $_ (#myarray) {
/(\d+)/ or die;
foo($1);
}
sub foo {
 open(F, "foo_$_[0]") or die;
while (<F>) {
...
}
}
where, because the loop variable in for/foreach is bound to the actual list item, means that the while (<F>) overwrites #myarray with lines read from the files.
$_ is the same as naming the variable as in your second example with the way it is usually used. $_ is just a shortcut default variable name for the current item in the current loop to save on typing when doing a quick, simple loop. I tend to use named variables rather than the default. It makes it more clear what it is and if I happen to need to do a nested loop there are no conflicts.
Since $_ is a global variable, you may get unexpected values if you try to use its value that it had from a previous code block. The new code block may be part of a loop or other operation that inserts its own values into $_, overwriting what you expected to be there.
The risk in using $_ is that it is global (unless you localise it with local $_), and so if some function you call in your loop also uses $_, the two uses can interfere.
For reasons which are not clear to me, this has only bitten me occasionally, but I usually localise $_ if I use it inside packages.
There is nothing special about $_ apart from it is the default parameter for many functions. If you explicitly lexically scope your $_ with my, perl will use the local version of $_ rather than the global one. There is nothing strange in this, it is just like any other named variable.
sub p { print "[$_]"; } # Prints the global $_
# Compare and contrast
for my $_ (b1..b5) { for my $_ (a1..a5) { p } } print "\n"; # ex1
for my $_ (b1..b5) { for (a1..a5) { p } } print "\n"; # ex2
for (b1..b5) { for my $_ (a1..a5) { p } } print "\n"; # ex3
for (b1..b5) { for (a1..a5) { p } } print "\n"; # ex4
You should be slightly mystified by the output until you find out that perl will preserve the original value of the loop variable on loop exit (see perlsyn).
Note ex2 above. Here the second loop is using the lexically scoped $_ declared in the first loop. Subtle, but expected. Again, this value is preserved on exit so the two loops do not interfere.

Is there an easy way to localise (preserve) all "magic variables" like $1, $& etc.?

I know that in a subroutine in Perl, it's a very good idea to preserve the "default variable" $_ with local before doing anything with it, in case the caller is using it, e.g.:
sub f() {
local $_; # Ensure $_ is restored on dynamic scope exit
while (<$somefile>) { # Clobbers $_, but that's OK -- it will be restored
...
}
}
Now, often the reason you use $_ in the first place is because you want to use regexes, which may put results in handy "magic" variables like $1, $2 etc. I'd like to preserve those variables too, but I haven't been able to find a way to do that.
All perlvar says is that #+ and #-, which $1 etc. seem to depend on internally, refer to the "last successful submatches in the currently active dynamic scope". But even that seems at odds with my experiments. Empirically, the following code prints "aXaa" as I had hoped:
$_ = 'a';
/(.)/; # Sets $1 to 'a'
print $1; # Prints 'a'
{
local $_; # Preserve $_
$_ = 'X';
/(.)/; # Sets $1 to 'X'
print $1; # Prints 'X'
}
print $_; # Prints 'a' ('local' restored the earlier value of $_)
print $1; # Prints 'a', suggesting localising $_ does localise $1 etc. too
But what I find truly surprising is that, in my ActivePerl 5.10.0 at least, commenting out the local line still preserves $1 -- that is, the answer "aXXa" is produced! It appears that the lexical (not dynamic) scope of the brace-enclosed block is somehow preserving the value of $1.
So I find this situation confusing at best and would love to hear a definitive explanation. Mind you, I'd actually settle for a bulletproof way to preserve all regex-related magic variables without having to enumerate them all as in:
local #+, #-, $&, $1, $2, $3, $4, ...
which is clearly a disgusting hack. Until then, I will worry that any regex I touch will clobber something the caller was not expecting to be clobbered.
Thanks!
Maybe you can suggest a better wording for the documentation. Dynamic scope means everything up to the start of the enclosing block or subroutine, plus everything up to the start of that block or subroutine call, etc. except that any closed blocks are excluded.
Another way to say it: "last successful submatches in the currently active dynamic scope" means there is implicitly a local $x=$x; at the start of each block for each variable.
Most of the mentions of dynamic scope (for instance, http://perldoc.perl.org/perlglossary.html#scope or http://perldoc.perl.org/perlglossary.html#dynamic-scoping)
are approaching it from the other way around. They apply if you think of a successful
regex as implicitly doing a local $1, etc.
I am not sure there is any real reason to be this paranoid about all these variables. I have managed to use Perl for almost ten years without once needing to use an explicit local in this context.
The answer to your specific question is: The number of digit variables is not a given (even though there is a hard memory limit to how many matches you can work with). So, it is not possible to localize all of them at the same time.
I think you are worrying too much. The best thing to do is run your match operator, immediately save the values you want into meaningful variables, then let the special variables do whatever they do without worrying about them:
if( $string =~ m/...(a.c).../ ) {
my $found = $1;
}
When I want to capture parts of the strings, I most often use the match operator in list context to get a list of the memories back:
my #array = $string =~ m/..../g;