Why "my #variable" inside a sub and modified by sub/sub behaves so strange - perl

I have a strange behaved (to Python programmer) subroutine, which simplified as the following:
use strict;
use Data::Dumper;
sub a {
my #x;
sub b { push #x, 1; print "inside: ", Dumper(\#x); }
&b;
print "outside: ", Dumper(\#x);
}
&a;
&a;
I found the result is:
inside: $VAR1=[ 1 ]
outside: $VAR1 = [ 1 ]
inside: $VAR1=[1, 1]
outside: $VAR1= []
What I thought is when calling &a, #x is empty array after "my #x" and has one element after "&b", then dead. Every time I call &a, it is the same. so the output should be all $VAR1 = [ 1 ].
Then I read something like named sub routine are defined once in symbol table, then I do "my $b = sub { ... }; &$b;", it seems make sense to me.
How to explain?

As per the "perlref" man page:
named subroutines are created at compile time so their lexical
variables [i.e., their 'my' variables] get assigned to the parent
lexicals from the first execution of the parent block. If a parent
scope is entered a second time, its lexicals are created again, while
the nested subs still reference the old ones.
In other words, a named subroutine (your b), has its #x bound to the parent subroutine's "first" #x, so when a is called the first time, b adds a 1 to #x, and both the inner and outer copies refer to this same version. However, the second time a is called, a new #x lexical is created, but b still points to the old one, so it adds a second 1 to that list and prints it (inner), but when it comes time for a to print its version, it prints out the (empty) brand new lexical (outer).
Anonymous subroutines don't exhibit this problem, so when you write my $b = sub { ... }, the inner #x always refers to the "current" version of a's lexical #x.

Related

Perl Recursive Explanation

I have following script that need to understand
&main('key_1');
sub main {
#{$tmp{'key_1'}} = ("A", "B");
#{$tmp{'A'}} = ("C");
&test_recurse(\%tmp, 'key_1');
}
sub test_recurse {
my ($hash, $cell) = #_;
foreach my $cur_subcell (#{$hash->{$cell}}){
&test_recurse($hash, $cur_subcell);
}
print "X($cell)\n";
}
The output:
X(C)
X(A)
X(B)
X(key_1)
I want to understand why the key_1 is printing at the last? I am expecting the key_1 might not be printed at all.
I am expecting the key_1 might not be printed at all
The function test_recurse ends with print "X($cell)\n". This means that it ends by printing its second argument. Since you initially call it with key_1 as argument, it prints X(key_1).
To understand a bit better how the function test_recurse works, I suggest to add some prints as follows:
sub test_recurse {
my ($hash, $cell, $offset) = #_;
print "${offset}test_recurse($cell)\n";
foreach my $cur_subcell (#{$hash->{$cell}}){
&test_recurse($hash, $cur_subcell, $offset . " ");
}
print "${offset}X($cell)\n";
}
Thanks to the addition of $offset, each time you make a recursive call, the prints within this recursive call are indented further to the right. Calling this modified function with test_recurse(\%tmp, 'key_1', ""), you'll get this output:
test_recurse(key_1)
test_recurse(A)
test_recurse(C)
X(C)
X(A)
test_recurse(B)
X(B)
X(key_1)
So, what happens is:
You call test_recurse with key_1. This prints test_recurse(key_1).
In the foreach loop, it will make two successive calls to test_recurse:
The first one with A as argument. This will print test_recurse(A).
In the foreach loop, it will make a call to
test_recurse with C as argument. This will print test_recurse(C).
Since $tmp{C} does not exist, this call does not enter in the foreach loop, and directly proceed to the final print and prints X(C). We then go back to the caller (test_recurse with A as argument).
Now that the foreach loop is done, this function moves on to the last print and prints X(A). We then go back to the caller (test_recurse with key_1 as argument).
The second recursive call is to test_recurse with B as argument. This will print test_recurse(B). Since $tmp{B} does not exist, we do not enter the foreach loop and move on to the final print, which prints X(B). We then return to the caller (test_recurse with key_1 as argument).
The foreach loop is now over and we move on to the final print, which prints X(key_1).
Some tips:
Always add use strict and use warnings at the beginning of your scripts.
#{$tmp{'key_1'}} = ("A", "B"); would be clearer as $tmp{'key_1'} = [ 'A', 'B' ].
The whole initialization of %tmp could actually be done with:
my %tmp = (
key_1 => [ 'A', 'B' ],
A => [ 'C' ]
);
You call &main('key_1'); with key_1 as argument, but main does not expect any argument.
To call a function, you don't need &: do test_recurse(\%tmp, 'key_1'); instead of &test_recurse(\%tmp, 'key_1');.
In a comment, you say:
I am just thinking that as if the variable $cell has been replace by C, A, B why the key_1 is coming back at the end.
And I think that's probably a good indication of where the confusion lies.
Your test_recurse() subroutine starts with this line:
my ($hash, $cell) = #_;
That defines two new variables called $hash and $cell and then populates them from #_. Because these variables are declared using my, they are lexical variables. That means they are only visible within the block of code where they are declared.
Later on in test_recurse() you call the same subroutine again. And, once again, that subroutine starts with the same declaration statement and creates another two variables called $hash and $cell. These two new variables are completely separate from the original two variables. The original variables still exist and still contain their original values - but you can't currently access them because they are declared in a different call to the subroutine.
So when your various calls to the subroutine end, you rewind back to the original call - and that still has the original two variables which still hold their original values.

Using perl `my` within actual function arguments

I want to use perl to build a document graph as readably as possible. For re-use of nodes, I want to refer to nodes using variables (or constants, if that is easier). The following code works and illustrates the idea with node types represented by literals or factory function calls to a and b. (For simple demo purposes, the functions do not create nodes but just return a string.)
sub a (#) {
return sprintf "a(%s)", join( ' ', #_ );
}
sub b (#) {
return sprintf "b(%s)", join( ' ', #_ );
}
printf "The document is: %s\n", a(
"declare c=",
$c = 1,
$e = b(
"use",
$c,
"to declare d=",
$d = $c + 1
),
"use the result",
$d,
"and document the procedure",
$e
);
The actual and expected output of this is The document is: a(declare c= 1 b(use 1 to declare d= 2) use the result 2 and document the procedure b(use 1 to declare d= 2)).
My problem arises because I want to use strict in the whole program so that variables like $c, $d, $e must be declared using my. I can, of course, write somewhere close to the top of the text my ( $c, $d, $e );. It would be more efficient at edit-time when I could use the my keyword directly at the first mention of the variable like so:
…
printf "The document is: %s\n", a(
"declare c=",
my $c = 1,
my $e = b(
"use",
$c,
"to declare d=",
my $d = $c + 1
),
"use the result",
$d,
"and document the procedure",
$e
);
This would be kind of my favourite syntax. Unfortunately, this code yields several Global symbol "…" requires explicit package name errors. (Moreover, according to documentation, my does not return anything.)
I have the idea of such use of my from uses like in open my $file, '<', 'filename.txt' or die; or in for ( my $i = 0; $i < 100; ++$i ) {…} where declaration and definition go in one.
Since the nodes in the graph are constants, it is acceptable to use something else than lexical variables. (But I think perl's built-in mechanims are strongest and most efficient for lexical variables, which is why I am inclined into this direction.)
My current idea to solve the issue is to define a function named something like define which behind the scenes would manipulate the current set of lexical variables using PadWalker or similar. Yet this would not allow me to use a natural perl like syntax like $c = 1, which would be my preferred syntax.
I am not certain of the exact need but here's one simple way for similar manipulations.
The example in the OP wants a named variable inside the function call statement itself, so that it can be used later in that statement for another call etc. If you must have it that way then you can use a do block to work out your argument list
func1(
do {
my $x = 5;
my $y = func2($x); # etc
say "Return from the do block what is then passed as arguments...";
$x, $y
}
);
This allows you to do things of the kind that your example indicates.†
If you also want to have names available in the subroutine then pass a hash (or a hashref), with suitably chosen key names for variables, and in the sub work with key names.
Alternatively, consider normally declaring your variables ahead of the function call. There's no bad thing about it while there are many good things. Can throw in a little wrapper and make it look nice, too.
† More specifically
printf "The document is: %s\n", a( do {
my $c = 1;
my $d = $c + 1;
my $e = b( "use", $c, "to declare d=", $d );
# Return a list from this `do`, which is then passed as arguments to a()
"declare c=", $c, $e, "use the result", $d,"and document the procedure", $e
} );
(condensed into fewer lines for posting here)
This do block is a half-way measure toward moving this code into a subroutine, as I presume that there are reasons to want this inlined. However, since comments indicate that the reality is even more complex I'd urge you to write a normal sub instead (in which a graph can be built, btw).
according to documentation, my does not return anything
The documentation doesn't say that, and it's not the case.
Haven't you ever done my $x = 123;? If so, you've assigned to the result of my $x. my simply returns the newly created variable as an lvalue (assignable value), so my $x simply returns $x.
Unfortunately, this code yields several [strict vars] errors.
Symbols (variables) created by my are only visible starting with the following statement.
For better of for worse, it allows the following:
my $x = 123;
{
my $x = $x;
$x *= 2;
say $x; # 246
}
say $x; # 123
I want to use perl to build a document graph as readably as possible.
So why not do that? Right now, you are building a string, not a graph. Build a graph of objects that resolve to a string after the graph has been constructed. You can build those object with a tree of sub calls (declare( c => [ use( c => ... ), ... ] )). I'd give a better example, but the grammar of what you are generating isn't clear to me.
Your argument list makes two references each to $c, $d and $e. If you prefix the first reference with my, it will be out of scope by the time Perl gets around to parsing the second reference it won't be in scope until the next statement, so the second reference would refer to a different variable (which may violate strict vars).
Declare my ($c,$d,$e) before your function call. There is nothing wrong or inelegant about doing that.

Why does this Perl variable keep its value

What is the difference between the following two Perl variable declarations?
my $foo = 'bar' if 0;
my $baz;
$baz = 'qux' if 0;
The difference is significant when these appear at the top of a loop. For example:
use warnings;
use strict;
foreach my $n (0,1){
my $foo = 'bar' if 0;
print defined $foo ? "defined\n" : "undefined\n";
$foo = 'bar';
print defined $foo ? "defined\n" : "undefined\n";
}
print "==\n";
foreach my $m (0,1){
my $baz;
$baz = 'qux' if 0;
print defined $baz ? "defined\n" : "undefined\n";
$baz = 'qux';
print defined $baz ? "defined\n" : "undefined\n";
}
results in
undefined
defined
defined
defined
==
undefined
defined
undefined
defined
It seems that if 0 fails, so foo is never reinitialized to undef. In this case, how does it get declared in the first place?
First, note that my $foo = 'bar' if 0; is documented to be undefined behaviour, meaning it's allowed to do anything including crash. But I'll explain what happens anyway.
my $x has three documented effects:
It declares a symbol at compile-time.
It creates an new variable on execution.
It returns the new variable on execution.
In short, it's suppose to be like Java's Scalar x = new Scalar();, except it returns the variable if used in an expression.
But if it actually worked that way, the following would create 100 variables:
for (1..100) {
my $x = rand();
print "$x\n";
}
This would mean two or three memory allocations per loop iteration for the my alone! A very expensive prospect. Instead, Perl only creates one variable and clears it at the end of the scope. So in reality, my $x actually does the following:
It declares a symbol at compile-time.
It creates the variable at compile-time[1].
It puts a directive on the stack that will clear[2] the variable when the scope is exited.
It returns the new variable on execution.
As such, only one variable is ever created[2]. This is much more CPU-efficient than then creating one every time the scope is entered.
Now consider what happens if you execute a my conditionally, or never at all. By doing so, you are preventing it from placing the directive to clear the variable on the stack, so the variable never loses its value. Obviously, that's not meant to happen, so that's why my ... if ...; isn't allowed.
Some take advantage of the implementation as follows:
sub foo {
my $state if 0;
$state = 5 if !defined($state);
print "$state\n";
++$state;
}
foo(); # 5
foo(); # 6
foo(); # 7
But doing so requires ignoring the documentation forbidding it. The above can be achieved safely using
{
my $state = 5;
sub foo {
print "$state\n";
++$state;
}
}
or
use feature qw( state ); # Or: use 5.010;
sub foo {
state $state = 5;
print "$state\n";
++$state;
}
Notes:
"Variable" can mean a couple of things. I'm not sure which definition is accurate here, but it doesn't matter.
If anything but the sub itself holds a reference to the variable (REFCNT>1) or if variable contains an object, the directive replaces the variable with a new one (on scope exit) instead of clearing the existing one. This allows the following to work as it should:
my #a;
for (...) {
my $x = ...;
push #a, \$x;
}
See ikegami's better answer, probably above.
In the first example, you never define $foo inside the loop because of the conditional, so when you use it, you're referencing and then assigning a value to an implicitly declared global variable. Then, the second time through the loop that outside variable is already defined.
In the second example, $baz is defined inside the block each time the block is executed. So the second time through the loop it is a new, not yet defined, local variable.

Shared variables in the context of subroutines vs anonymous subroutines

I saw this bit of code in an answer to another post: Why would I use Perl anonymous subroutines instead of a named one?, but couldn't figure out exactly what as going on, so I wanted to run it myself.
sub outer
{
my $a = 123;
sub inner
{
print $a, "\n"; #line 15 (for your reference, all other comments are the OP's)
}
# At this point, $a is 123, so this call should always print 123, right?
inner();
$a = 456;
}
outer(); # prints 123
outer(); # prints 456! Surprise!
In the above example, I received a warning: "Variable $a will not stay shared at line 15.
Obviously, this is why the output is "unexpected," but I still don't really understand what's happening here.
sub outer2
{
my $a = 123;
my $inner = sub
{
print $a, "\n";
};
# At this point, $a is 123, and since the anonymous subrotine
# whose reference is stored in $inner closes over $a in the
# "expected" way...
$inner->();
$a = 456;
}
# ...we see the "expected" results
outer2(); # prints 123
outer2(); # prints 123
In the same vein, I don't understand what's happening in this example either. Could someone please explain?
Thanks in advance.
It has to do with compile-time vs. run-time parsing of subroutines. As the diagnostics message says,
When the inner subroutine is called, it will see the value of
the outer subroutine's variable as it was before and during the first
call to the outer subroutine; in this case, after the first call to the
outer subroutine is complete, the inner and outer subroutines will no
longer share a common value for the variable. In other words, the
variable will no longer be shared.
Annotating your code:
sub outer
{
# 'my' will reallocate memory for the scalar variable $a
# every time the 'outer' function is called. That is, the address of
# '$a' will be different in the second call to 'outer' than the first call.
my $a = 123;
# the construction 'sub NAME BLOCK' defines a subroutine once,
# at compile-time.
sub inner1
{
# since this subroutine is only getting compiled once, the '$a' below
# refers to the '$a' that is allocated the first time 'outer' is called
print "inner1: ",$a, "\t", \$a, "\n";
}
# the construction sub BLOCK defines an anonymous subroutine, at run time
# '$inner2' is redefined in every call to 'outer'
my $inner2 = sub {
# this '$a' now refers to '$a' from the current call to outer
print "inner2: ", $a, "\t", \$a, "\n";
};
# At this point, $a is 123, so this call should always print 123, right?
inner1();
$inner2->();
# if this is the first call to 'outer', the definition of 'inner1' still
# holds a reference to this instance of the variable '$a', and this
# variable's memory will not be freed when the subroutine ends.
$a = 456;
}
outer();
outer();
Typical output:
inner1: 123 SCALAR(0x80071f50)
inner2: 123 SCALAR(0x80071f50)
inner1: 456 SCALAR(0x80071f50)
inner2: 123 SCALAR(0x8002bcc8)
You can print \&inner; in the first example (after definition), and print $inner; in second.
What you see are hex code references which are equal in first example and differ in second.
So, in the first example inner gets created only once, and it is always closure to $a lexical variable from the first call of the outer().

Perl - What scopes/closures/environments are producing this behaviour?

Given a root directory I wish to identify the most shallow parent directory of any .svn directory and pom.xml .
To achieve this I defined the following function
use File::Find;
sub firstDirWithFileUnder {
$needle=#_[0];
my $result = 0;
sub wanted {
print "\twanted->result is '$result'\n";
my $dir = "${File::Find::dir}";
if ($_ eq $needle and ((not $result) or length($dir) < length($result))) {
$result=$dir;
print "Setting result: '$result'\n";
}
}
find(\&wanted, #_[1]);
print "Result: '$result'\n";
return $result;
}
..and call it thus:
$svnDir = firstDirWithFileUnder(".svn",$projPath);
print "\tIdentified svn dir:\n\t'$svnDir'\n";
$pomDir = firstDirWithFileUnder("pom.xml",$projPath);
print "\tIdentified pom.xml dir:\n\t'$pomDir'\n";
There are two situations which arise that I cannot explain:
When the search for a .svn is successful, the value of $result perceived inside the nested subroutine wanted persists into the next call of firstDirWithFileUnder. So when the pom search begins, although the line my $result = 0; still exists, the wanted subroutine sees its value as the return value from the last firstDirWithFileUnder call.
If the my $result = 0; line is commented out, then the function still executes properly. This means a) outer scope (firstDirWithFileUnder) can still see the $result variable to be able to return it, and b) print shows that wanted still sees $result value from last time, i.e. it seems to have formed a closure that's persisted beyond the first call of firstDirWithFileUnder.
Can somebody explain what's happening, and suggest how I can properly reset the value of $result to zero upon entering the outer scope?
Using warnings and then diagnostics yields this helpful information, including a solution:
Variable "$needle" will not stay shared at ----- line 12 (#1)
(W closure) An inner (nested) named subroutine is referencing a
lexical variable defined in an outer named subroutine.
When the inner subroutine is called, it will see the value of
the outer subroutine's variable as it was before and during the first
call to the outer subroutine; in this case, after the first call to the
outer subroutine is complete, the inner and outer subroutines will no
longer share a common value for the variable. In other words, the
variable will no longer be shared.
This problem can usually be solved by making the inner subroutine
anonymous, using the sub {} syntax. When inner anonymous subs that
reference variables in outer subroutines are created, they
are automatically rebound to the current values of such variables.
$result is lexically scoped, meaning a brand new variable is allocated every time you call &firstDirWithFileUnder.
sub wanted { ... } is a compile-time subroutine declaration, meaning it is compiled by the Perl interpreter one time and stored in your package's symbol table. Since it contains a reference to the lexically scoped $result variable, the subroutine definition that Perl saves will only refer to the first instance of $result. The second time you call &firstDirWithFileUnder and declare a new $result variable, this will be a completely different variable than the $result inside &wanted.
You'll want to change your sub wanted { ... } declaration to a lexically scoped, anonymous sub:
my $wanted = sub {
print "\twanted->result is '$result'\n";
...
};
and invoke File::Find::find as
find($wanted, $_[1])
Here, $wanted is a run-time declaration for a subroutine, and it gets redefined with the current reference to $result in every separate call to &firstDirWithFileUnder.
Update: This code snippet may prove instructive:
sub foo {
my $foo = 0; # lexical variable
$bar = 0; # global variable
sub compiletime {
print "compile foo is ", ++$foo, " ", \$foo, "\n";
print "compile bar is ", ++$bar, " ", \$bar, "\n";
}
my $runtime = sub {
print "runtime foo is ", ++$foo, " ", \$foo, "\n";
print "runtime bar is ", ++$bar, " ", \$bar, "\n";
};
&compiletime;
&$runtime;
print "----------------\n";
push #baz, \$foo; # explained below
}
&foo for 1..3;
Typical output:
compile foo is 1 SCALAR(0xac18c0)
compile bar is 1 SCALAR(0xac1938)
runtime foo is 2 SCALAR(0xac18c0)
runtime bar is 2 SCALAR(0xac1938)
----------------
compile foo is 3 SCALAR(0xac18c0)
compile bar is 1 SCALAR(0xac1938)
runtime foo is 1 SCALAR(0xa63d18)
runtime bar is 2 SCALAR(0xac1938)
----------------
compile foo is 4 SCALAR(0xac18c0)
compile bar is 1 SCALAR(0xac1938)
runtime foo is 1 SCALAR(0xac1db8)
runtime bar is 2 SCALAR(0xac1938)
----------------
Note that the compile time $foo always refers to the same variable SCALAR(0xac18c0), and that this is also the run time $foo THE FIRST TIME the function is run.
The last line of &foo, push #baz,\$foo is included in this example so that $foo doesn't get garbage collected at the end of &foo. Otherwise, the 2nd and 3rd runtime $foo might point to the same address, even though they refer to different variables (the memory is reallocated each time the variable is declared).