Documentation for inlining of built-ins? - perl

I ran into a situation where I can't inhibit warnings in an intuitive way because perl is in-lining a call to a built-in function. e.g.
use strict;
use warnings;
{
no warnings 'substr'; # no effect
foo(substr('123', 4, 6)); # out of range but shouldn't emit a warning
}
sub foo {
my $s = shift; # warning reported here
# do something
}
Running this code results in
substr outside of string at c:\temp\foo.pl line 10.
In order to inhibit the warning I have to move the no warnings 'substr' inside the function.
sub foo {
no warnings 'substr'; # works here, but there's no call to substr
my $s = shift; # no warnings here
# do something
}
I can see that the call to substr is being inlined by passing the code through perl -MO=Terse
LISTOP (0x27dcaa8) leave [1]
OP (0x27a402c) enter
COP (0x27dcac8) nextstate
BINOP (0x27dcb00) leaveloop
LOOP (0x27dcb20) enterloop
LISTOP (0x27dcb68) lineseq
COP (0x27dcb88) nextstate
UNOP (0x27dcbc0) entersub [5] # entry point for foo
UNOP (0x27dcbf4) null [148]
OP (0x27dcbdc) pushmark
LISTOP (0x27dcc48) substr [4] # substr gets called here
OP (0x27dcc30) null [3]
SVOP (0x27dcc84) const [6] PV (0x2319944) "123"
SVOP (0x27dcc68) const [7] IV (0x2319904) 4
SVOP (0x27dcc14) const [8] IV (0x231944c) 6
UNOP (0x27dcca0) null [17]
PADOP (0x27dccf4) gv GV (0x2318e5c) *foo
Is this optimizer behavior documented anywhere? perlsub only mentions inlining of constant functions. Given that the warning is being reported on the wrong line and that no warnings isn't working in the lexical scope where the call is being made I'm inclined to report this as a bug, although I can't think of how it could reasonably be fixed while preserving the optimization.
Note: This behavior was observed under Perl 5.16.1.

This is a documented behaviour (in perldiag):
substr outside of string
(W substr),(F) You tried to reference a substr() that pointed
outside of a string. That is, the absolute value of the offset was
larger than the length of the string. See "substr" in perlfunc.
This warning is fatal if substr is used in an lvalue context (as
the left hand side of an assignment or as a subroutine argument for
example).
Emphasis mine.
Changing the call to
foo(my $o = substr('123', 4, 6));
makes the warnings disappear.
Moving the no warnings into the sub doesn't change the behaviour for me. What Perl version do you have? (5.14.4 here).
The code I used for testing:
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
print 1, foo(my $s1 = substr('abc', 4, 6));
print 2, bar(my $s2 = substr('def', 4, 6));
{
no warnings 'substr';
print 3, foo(my $s3 = substr('ghi', 4, 6));
print 4, bar(my $s4 = substr('jkl', 4, 6));
print 5, bar(substr('mno', 4, 6)); # Stops here, reports line 12.
print 6, foo(substr('pqr', 4, 6));
}
print "ok\n";
sub foo {
my $s = shift;
}
sub bar {
no warnings 'substr';
my $s = shift;
}
Update:
I'm getting the same behaviour in 5.10.1, but in 5.20.1, the behaviour is as you described.

As you saw from B::Terse, the substr is not inlined.
$ perl -MO=Concise,-exec -e'f(substr($_, 3, 4))'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <0> pushmark s
4 <#> gvsv[*_] s
5 <$> const[IV 3] s
6 <$> const[IV 4] s
7 <#> substr[t4] sKM/3 <-- The substr operator is evaluated first.
8 <#> gv[*f] s/EARLYCV
9 <1> entersub[t5] vKS/TARG <-- The sub call second.
a <#> leave[1 ref] vKP/REFC
-e syntax OK
When substr is called as an lvalue context, substr returns a magical scalar that contains the operands passed to substr.
$ perl -MDevel::Peek -e'$_ = "abcdef"; Dump(${\ substr($_, 3, 4) })'
SV = PVLV(0x2865d60) at 0x283fbd8
REFCNT = 2
FLAGS = (GMG,SMG) <--- Gets and sets are magical.
IV = 0 GMG: A function that mods the scalar
NV = 0 is called before fetches.
PV = 0 SMG: A function is called after the
MAGIC = 0x2856810 scalar is modified.
MG_VIRTUAL = &PL_vtbl_substr
MG_TYPE = PERL_MAGIC_substr(x)
TYPE = x
TARGOFF = 3 <--- substr's second arg
TARGLEN = 4 <--- substr's third arg
TARG = 0x287bfd0 <--- substr's first arg
FLAGS = 0
SV = PV(0x28407f0) at 0x287bfd0 <--- A dump of substr's first arg
REFCNT = 2
FLAGS = (POK,IsCOW,pPOK)
PV = 0x2865d20 "abcdef"\0
CUR = 6
LEN = 10
COW_REFCNT = 1
Subroutine arguments are evaluated in lvalue context because subroutine arguments are always passed by reference in Perl[1].
$ perl -E'sub f { $_[0] = "def"; } $x = "abc"; f($x); say $x;'
def
The substring operation happens when the magical scalar is accessed.
$ perl -E'$x = "abc"; $r = \substr($x, 0, 1); $x = "def"; say $$r;'
d
This is done to allow substr(...) = "abc";
This is probably documented using language similar to the following: "The elements of #_ are aliased to the subroutine arguments."

Related

Perl's odd behavior of index() when called with empty substr with vs without Encode::decode()

I have created a very small sample code below to illustrate how Perl's index() function's return value changes for empty substr ("") on string that is passed or not passed through Encode::decode().
use strict;
use Encode;
my $mainString = (#ARGV >= 2) ? $ARGV[1] : "abc";
my $subString = (#ARGV >= 3) ? $ARGV[2] : "";
if (#ARGV >= 1) {
$mainString = Encode::decode("utf8", $mainString);
}
my $position = index($mainString, $subString, 0);
my $loopCount = 0;
my $stopLoop = 7; # It goes for ever so set a stopping value
while ($position >= 0) {
if ($loopCount >= $stopLoop) {
last;
}
$loopCount++;
print "[$loopCount]: $position \"$mainString\" [".length($mainString)."] ($subString)\n";
$position = index($mainString, $subString, $position + 1);
}
Before getting into with vs without Encode::decode(), what should the return value of index() be for an empty substr ("") because Perl's documentation does not mention it. Although it does not mention it, here is the execution result without calling Encode::decode() for ASCII characters "abc" (#ARGV = 0):
>perl StringIndex.pl
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 3 "abc" [3] ()
[6]: 3 "abc" [3] ()
[7]: 3 "abc" [3] ()
However when encoding is involved, the return value changes. The return value changes as if the string being searched is not bounded by its length when called with Encode::decode() for ASCII characters "abc" ($ARGV[0] = 1):
>perl StringIndex.pl 1
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 4 "abc" [3] ()
[6]: 5 "abc" [3] ()
[7]: 6 "abc" [3] ()
As a Side Note:
substr is set to empty string ("") in above example, but in my real program it is a variable that changes value depending on condition.
I understand the simplest solution is to check if substr is empty and not enter the while loop
I am using "This is perl 5, version 28, subversion 1 (v5.28.1) built
for MSWin32-x64-multi-thread"
This would be considered a bug, which I've reported here.
Minimal code to reproduce:
use strict;
use warnings;
no warnings qw( void );
use feature qw( say );
my $s = "abc";
my $len = length($s);
utf8::upgrade($s);
length($s) if $ARGV[0];
say index($s, "", $len+1);
$ perl a.pl 0
3
$ perl a.pl 1
4
Perl has two string storage formats. The "upgraded" format, and the "downgraded" format.
Encode::decode always return an upgraded string. And utf8::upgrade tells Perl to switch the storage format used by a scalar.
Each character of a downgraded string can store a number between 0 and 255. Each character of the string is stored as a byte of the appropriate value. This, of course, is fine if you have bytes or ASCII text. But this is insufficient for arbitrary text.
Each character of an upgraded string can store a number between 0 and 232-1 or between 0 and 264-1 depending on how your Perl was compiled. This is more than enough to store any Unicode Code Point (even those outside the BMP). Each character is encoded using "utf8", a nonstandard extension of UTF-8.
utf8 (like UTF-8) is variable-length encoding. This presents two problems:
Determining the length of an upgraded string requires iterating over the entire string.
Determining the position of characters in a upgraded string requires iterating over the entire string.
Let's consider the following snippet:
index($str, $substr, $pos)
With a downgraded string, index can jump directly to the position indicated by $pos. It's a question of simple pointer arithmetic.
But because each character of an upgrade string can require a different amount of storage, index can't use pointer arithmetic to find the character at position $pos. Without optimizations, each use to index would have to start at offset 0 and move through the string until it finds the character indicated by $pos.
That would be unfortunate. Imagine if index was being used in a loop to find all matches. So Perl optimizes this! When the length of an upgraded string becomes known, Perl attaches it to the scalar.
$ perl -MDevel::Peek -e'
$s = "abc";
utf8::upgrade($s);
Dump($s);
length($s);
Dump($s);
'
SV = PV(0x56483dda7e80) at 0x56483ddd5ba0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
SV = PVMG(0x56483de0ecf0) at 0x56483ddd5ba0
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
MAGIC = 0x56483ddd4050
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 3 <-- Attached length
Similarly, the offset of characters is sometimes attached to the scalar as well!
$ perl -MDevel::Peek -e'
$s = "abc";
utf8::upgrade($s);
Dump($s);
index($s, "", 2);
Dump($s);
'
SV = PV(0x558d5c970e80) at 0x558d5c99ebc0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
SV = PVMG(0x558d5c9d7d10) at 0x558d5c99ebc0
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
MAGIC = 0x558d5c9af690
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = -1
MG_PTR = 0x558d5c99cb80
0: 2 -> 2 <-- Attached character offset
1: 0 -> 0 <-- Attached character offset
The difference in behaviour is due to different code being paths in the code being exercised based on the string format and what information is cached.

Why is "keys ::" not a syntax error?

I tried the following one-liner more out of curiosity than anything and was surprised that it actually worked without the % sigil.
$ perl -E 'say for keys ::'
It works on both versions 5.8.8 and 5.16.3; though the latter version emits this warning:
Hash %:: missing the % in argument of keys() at -e line 1.
How does this even work? What is so special about %:: that allows it to run and print its keys, even without the sigil?
Note that the keys do not get printed with %main::.
$ perl -E 'say for keys main::'
Hash main:: missing the % in argument 1 of keys() at -e line 1.
TL;DR
:: isn't special; prior to Perl 5.22.0, you can omit the % and pass any identifier to keys.
However:
keys main:: is equivalent to keys %{'main'} or just keys %main
keys :: is equivalent to keys %{'::'} or just keys %::.Note that %main:: (but not %main) is an alias for %::.
The relevant code is in toke.c (the following is from 5.8.8):
/* Look for a subroutine with this name in current package,
unless name is "Foo::", in which case Foo is a bearword
(and a package name). */
if (len > 2 &&
PL_tokenbuf[len - 2] == ':' && PL_tokenbuf[len - 1] == ':')
{
if (ckWARN(WARN_BAREWORD) && ! gv_fetchpv(PL_tokenbuf, FALSE, SVt_PVHV))
Perl_warner(aTHX_ packWARN(WARN_BAREWORD),
"Bareword \"%s\" refers to nonexistent package",
PL_tokenbuf);
len -= 2;
PL_tokenbuf[len] = '\0';
gv = Nullgv;
gvp = 0;
}
else {
len = 0;
if (!gv)
gv = gv_fetchpv(PL_tokenbuf, FALSE, SVt_PVCV);
}
/* if we saw a global override before, get the right name */
if (gvp) {
sv = newSVpvn("CORE::GLOBAL::",14);
sv_catpv(sv,PL_tokenbuf);
}
else {
/* If len is 0, newSVpv does strlen(), which is correct.
If len is non-zero, then it will be the true length,
and so the scalar will be created correctly. */
sv = newSVpv(PL_tokenbuf,len);
}
len is the length of the current token.
If the token is main::, a new scalar is created with its PV (string component) set to main.
If the token is ::, a typeglob is fetched with gv_fetchpv.
gv_fetchpv lives in gv.c and has special logic for handling :::
if (*namend == ':')
namend++;
namend++;
name = namend;
if (!*name)
return gv ? gv : (GV*)*hv_fetch(PL_defstash, "main::", 6, TRUE);
This fetches the typeglob stored in the default stash under key main:: (i.e. typeglob *main::).
Finally, keys expects its argument to be a hash, but if you pass it an identifier, it treats it as the name of a hash. See Perl_ck_fun in op.c:
case OA_HVREF:
if (kid->op_type == OP_CONST &&
(kid->op_private & OPpCONST_BARE))
{
char *name = SvPVx(((SVOP*)kid)->op_sv, n_a);
OP * const newop = newHVREF(newGVOP(OP_GV, 0,
gv_fetchpv(name, TRUE, SVt_PVHV) ));
if (ckWARN2(WARN_DEPRECATED, WARN_SYNTAX))
Perl_warner(aTHX_ packWARN2(WARN_DEPRECATED, WARN_SYNTAX),
"Hash %%%s missing the %% in argument %"IVdf" of %s()",
name, (IV)numargs, PL_op_desc[type]);
op_free(kid);
kid = newop;
kid->op_sibling = sibl;
*tokid = kid;
}
else if (kid->op_type != OP_RV2HV && kid->op_type != OP_PADHV)
bad_type(numargs, "hash", PL_op_desc[type], kid);
mod(kid, type);
break;
This works for things other than ::, too:
$ perl -e'%h = (foo => "bar"); print for keys h'
foo
(As of 5.22.0, you're no longer allowed to omit the % sigil.)
You can also see this with B::Concise:
$ perl -MO=Concise -e'keys main::'
Hash %main missing the % in argument 1 of keys() at -e line 1.
6 <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
5 <1> keys[t2] vK/1 ->6
4 <1> rv2hv[t1] lKRM/1 ->5
3 <$> gv(*main) s ->4
-e syntax OK
$ perl -MO=Concise -e'keys ::'
Hash %:: missing the % in argument 1 of keys() at -e line 1.
6 <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
5 <1> keys[t2] vK/1 ->6
4 <1> rv2hv[t1] lKRM/1 ->5
3 <$> gv(*main::) s ->4
-e syntax OK
Using:
perl -MO=Deparse -E 'say for keys ::'
Says:
use feature 'current_sub', 'evalbytes', 'fc', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval';
say $_ foreach (keys %main::);
So it treats :: as %:: in these perl versions without a strict

Trouble with shift and dereference operator

I have a question regarding how the left and right sides of the -> operator are evaluated. Consider the following code:
#! /usr/bin/perl
use strict;
use warnings;
use feature ':5.10';
$, = ': ';
$" = ', ';
my $sub = sub { "#_" };
sub u { shift->(#_) }
sub v { my $s = shift; $s->(#_) }
say 'u', u($sub, 'foo', 'bar');
say 'v', v($sub, 'foo', 'bar');
Output:
u: CODE(0x324718), foo, bar
v: foo, bar
I expect u and v to behave identically but they don't. I always assumed perl evaluated things left to right in these situations. Code like shift->another_method(#_) and even shift->another_method(shift, 'stuff', #_) is pretty common.
Why does this break if the first argument happens to be a code reference? Am I on undefined / undocumented territory here?
The operand evaluation order of ->() is undocumented. It happens to evaluate the arguments before the LHS (lines 3-4 and 5 respectively below).
>perl -MO=Concise,u,-exec a.pl
main::u:
1 <;> nextstate(main 51 a.pl:11) v:%,*,&,x*,x&,x$,$,469762048
2 <0> pushmark s
3 <#> gv[*_] s
4 <1> rv2av[t2] lKM/3
5 <0> shift s*
6 <1> entersub[t3] KS/TARG,2
7 <1> leavesub[1 ref] K/REFC,1
a.pl syntax OK
Both using and modifying a variable in the same expression can be dangerous. It's best to avoid it unless you can explain the following:
>perl -E"$i=5; say $i,++$i,$i"
666
You could use
$_[0]->(#_[1..$#_])

Explanation for 'uninitialized value' warning

Why perl -we '$c = $c+3' rises
Use of uninitialized value $c in addition (+) at -e line 1.
and perl -we '$c += 3' doesn't complain about uninitialized value?
UPDATE
Does documentation or some book like 'Perl best practices' mention such behavior?
I think perldoc perlop has a little explanation:
Assignment Operators
"=" is the ordinary assignment operator.
Assignment operators work as in C. That is,
$a += 2;
is equivalent to
$a = $a + 2;
although without duplicating any side effects that dereferencing the
lvalue might trigger, such as from tie()
With B::Concise helper, we can see the trick:
$ perl -MO=Concise,-exec -e '$c += 3'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <#> gvsv[*c] s
4 <$> const[IV 3] s
5 <2> add[t2] vKS/2
6 <#> leave[1 ref] vKP/REFC
-e syntax OK
$ perl -MO=Concise,-exec -e '$c = $c + 3'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <#> gvsv[*c] s
4 <$> const[IV 3] s
5 <2> add[t3] sK/2
6 <#> gvsv[*c] s
7 <2> sassign vKS/2
8 <#> leave[1 ref] vKP/REFC
-e syntax OK
Update
After searching in perldoc, I saw that this problem had been documented in perlsyn:
Declarations
The only things you need to declare in Perl are report formats and subroutines (and sometimes not even subroutines). A variable
holds the undefined value ("undef") until it has been assigned a defined value, which is anything other than "undef". When used as a
number, "undef" is treated as 0; when used as a string, it is treated as the empty string, ""; and when used as a reference that
isn't being assigned to, it is treated as an error. If you enable warnings, you'll be notified of an uninitialized value whenever
you treat "undef" as a string or a number. Well, usually. Boolean contexts, such as:
my $a;
if ($a) {}
are exempt from warnings (because they care about truth rather than definedness). Operators such as "++", "--", "+=", "-=", and
".=", that operate on undefined left values such as:
my $a;
$a++;
are also always exempt from such warnings.
Because it makes sense for addition to warn when adding things other than numbers, but it's very convenient for += not to warn for undefined values.
As Gnouc found, this is documented in perlsyn:
Operators such as ++ , -- , += , -= , and .= , that operate on undefined variables such as:
undef $a;
$a++;
are also always exempt from such warnings.

In Perl, is there any difference between direct glob aliasing and aliasing via the stash?

In Perl, is there ever any difference between the following two constructs:
*main::foo = *main::bar
and
$main::{foo} = $main::{bar}
They appear to have the same function (aliasing all of the slots in *main::foo to those defined in *main::bar), but I am just wondering if this equivalency always holds.
Maybe not the kind of difference you were looking for, but there are two big differences between *main::foo and $main::{foo}; the former looks up the glob in the stash at compile time, creating it if necessary, while the latter looks for the glob in the stash at run time, and won't create it.
This may make a difference to anything else poking about in the stash, and it certainly can affect whether you get a used only once warning.
The following script:
#!/usr/bin/env perl
#mytest.pl
no warnings;
$bar = "this";
#bar = qw/ 1 2 3 4 5 /;
%bar = qw/ key value /;
open bar, '<', 'mytest.pl' or die $!;
sub bar {
return "Sub defined as 'bar()'";
}
$main::{foo} = $main::{bar};
print "The scalar \$foo holds $foo\n";
print "The array \#foo holds #foo\n";
print "The hash \%foo holds ", %foo, "\n";
my $line = <foo>;
print "The filehandle 'foo' is reads ", $line;
print 'The function foo() replies "', foo(), "\"\n";
Outputs:
The scalar $foo holds this
The array #foo holds 1 2 3 4 5
The hash %foo holds keyvalue
The filehandle 'foo' is reads #!/usr/bin/env perl
The function foo() replies "Sub defined as 'bar()'"
So if *main::foo = *main::bar; doesn't do the same thing as $main::{foo} = $main::{bar};, I'm at a loss as to how to detect a practical difference. ;) However, from a syntax perspective, there may be situations where it's easier to use one method versus another. ...the usual warnings about mucking around in the symbol table always apply.
Accessing the stash as $A::{foo} = $obj allows you to place anything on the symbol table while *A::foo = $obj places $obj on the expected slot of the typeglob according to $obj type.
For example:
DB<1> $ST::{foo} = [1,2,3]
DB<2> *ST::bar = [1,2,3]
DB<3> x #ST::foo
Cannot convert a reference to ARRAY to typeglob at (eval 7)[/usr/local/perl/blead-debug/lib/5.15.0/perl5db.pl:646] line 2.
at (eval 7)[/usr/local/perl/blead-debug/lib/5.15.0/perl5db.pl:646] line 2
eval '($#, $!, $^E, $,, $/, $\\, $^W) = #saved;package main; $^D = $^D | $DB::db_stop;
#ST::foo;
;' called at /usr/local/perl/blead-debug/lib/5.15.0/perl5db.pl line 646
DB::eval called at /usr/local/perl/blead-debug/lib/5.15.0/perl5db.pl line 3442
DB::DB called at -e line 1
DB<4> x #ST::bar
0 1
1 2
2 3
DB<5> x \%ST::
0 HASH(0x1d55810)
'bar' => *ST::bar
'foo' => ARRAY(0x1923e30)
0 1
1 2
2 3
See also "Scalars vs globs (*{} should not return FAKE globs)"
https://github.com/perl/perl5/issues/10625