Find the C APIs used for perl script - perl

Trying to understand the C code that's behind a perl script. For example, the following contrived code:
$name = "john";
$greeting = "hi $name, how old are you?";
if ($greeting =~ /hi (\S+)/) {
$b = $1;
print "got $b as expected\n";
}
Would like to know how the variable $name is substituted in $greeting string, also would like to know what c API is used for the regular expression match.
I heard something like perl -MO=Bytecode,-H test.pl where test.pl has the above content, but the output is bindary.

There isn't a direct mapping of Perl code to C code. Instead, Perl is a bytecode compiler. What you can get is the bytecode, the tree of opcodes. There's several modules to get this in a human readable form, one is B::Concise.
perl -MO=Concise test.pl
Produces this...
w <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 test.plx:1) v:{ ->3
5 <2> sassign vKS/2 ->6
3 <$> const[PV "john"] s ->4
- <1> ex-rv2sv sKRM*/1 ->5
4 <#> gvsv[*name] s ->5
6 <;> nextstate(main 1 test.plx:2) v:{ ->7
d <2> sassign vKS/2 ->e
- <1> ex-stringify sK/1 ->c
- <0> ex-pushmark s ->7
b <2> concat[t5] sKS/2 ->c
9 <2> concat[t4] sK/2 ->a
7 <$> const[PV "hi "] s ->8
- <1> ex-rv2sv sK/1 ->9
8 <#> gvsv[*name] s ->9
a <$> const[PV ", how old are you?"] s ->b
- <1> ex-rv2sv sKRM*/1 ->d
c <#> gvsv[*greeting] s ->d
The documentation for B::Concise explains all this. This tells you the operator sequence, type, name, flags, and the next op in the sequence. For example...
7 <$> const[PV "hi "] s ->8
This is operator 7, it is an SVOP (it applies to scalars), its name is "const" and it's for the scalar string (PV) "hi ", it's in scalar context, and the next operator is 8.
More about operators can be learned from perlguts and the Illustrated Perl Guts and by poking around in the Perl source code. Each operator has a C function associated with it called pp_OPNAME so to find the "const" operator look for pp_const.
The Perl regular expression engine is completely custom and has its own perlreguts documentation.

Related

Perl increment operator

$a = 10;
$b = (++$a) + (++$a) + (++$a);
print $b;
I am getting the answer 37.
Can anybody explain how this operation is proceeding and how the result is getting 37.
As per my logic it should be 36:
(++$a) + (++$a) + (++$a)
11 + 12 + 13 = 36
But I am getting the answer 37
Perl's is executing this as
( ( $a = $a + 1 ) + ( $a = $a + 1 ) ) + ( $a = $a + 1 )
You have even put the ++$a in parentheses so to say that they should happen first, before the additions, although they are of higher priority anyway
This is centred around the fact that the assignment operator = returns its first operand, which allows operations like
(my $x = $y) =~ tr/A-Z/a-z/
If the result of the assignment were simply the value copied from $y to $x then the tr/// would cause a Can't modify a constant item or the equivalent, and it would have no effect on what was stored in either variable
Here is the variable $a, and the execution is as follows
Execute the first increment, returning $a
$a is now 11
Execute the second increment, returning $a again
$a is now 12
Execute the first addition, which adds what was returned by the two increments—both $a
$a is 12, so $a + $a is 24
Execute the third increment, returning $a again
$a is now 13
Execute the second addition, which adds the what was returned by the first addition (24) and the third increment ($a)
$a is 13, so 24 + $a is 37
Note that this should not be relied on. It is not documented anywhere except to say that it us undefined, and the behaviour could change with any release of Perl
As a complement to mob and Borodin's answer, you can see what's happening clearly if you think about how the operations are interacting with the stack and recognize that preinc returns the variable, not its value.
op | a's value | stack
$a | 10 | $a
++ | 11 | $a
$a | 11 | $a $a
++ | 12 | $a $a
+ | 12 | 24
$a | 12 | 24 $a
++ | 13 | 24 $a
+ | 13 | 37
As it has been noted in comments, changing a variable multiple times within a single statement leads to undefined behavior, as explained in perlop.
So the exact behavior is not specified and may vary between versions and implementations.
As to how it works out, here is one way to see it. Since + is a binary operator, at each operation its left-hand-side operand does get involved when ++ is executed on the other. So at each position $a gets ++ed, and picks up another increment as a LHS operand.
That means that the LHS $a gets incremented additionally (to its ++) once in each + operation. The + operations after the first one must accumulate these, one extra for each extra term. With three terms here that's another +3, once. So there are altogether 7 increments.
Yet another (fourth) term incurs an extra +4, etc
perl -wE'$x=10; $y = ++$x + ++$x + ++$x + ++$x; say $y' # 4*10 + 2+2+3+4
This is interesting to tweak by changing ++$x to $x++ -- the effect depends on position.
Increments in steps
first $a gets incremented (to 11)
in the first addition, as the second $a is incremented (to 11) the first one gets a bump as well being an operand (to 12)
in the second addition, the second $a gets incremented (to 12) as an operand
as the second addition comes up, the third $a is updated and thus picks up increments from both additions, plus its increment (to 13)
The enumeration of $a above refers to their presence at multiple places in the statement.
As #Håkon Hægland pointed out, running this code under B::Concise, which outputs the opcodes that the Perl script generates, is illuminating. Here's are two slightly different examples than the one you provided:
$ perl -E 'say $b=$a + ((++$a)+(++$a))'
6
$ perl -E 'say $b=($a+(++$a)) + (++$a)'
4
So what's going on here? Let's look at the opcodes:
$ perl -MO=Concise -E 'say $b=$a+((++$a)+(++$a))'
e <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 47 -e:1) v:%,{,469764096 ->3
d <#> say vK ->e
3 <0> pushmark s ->4
c <2> sassign sKS/2 ->d
a <2> add[t6] sK/2 ->b
- <1> ex-rv2sv sK/1 ->5
4 <#> gvsv[*a] s ->5
9 <2> add[t5] sKP/2 ->a
6 <1> preinc sKP/1 ->7
- <1> ex-rv2sv sKRM/1 ->6
5 <#> gvsv[*a] s ->6
8 <1> preinc sKP/1 ->9
- <1> ex-rv2sv sKRM/1 ->8
7 <#> gvsv[*a] s ->8
- <1> ex-rv2sv sKRM*/1 ->c
b <#> gvsv[*b] s ->c
-e syntax OK
There are no conditionals in this program. The left most column indicates the order of operations in this program. Whereever you see the ex-rv2sv token, that is where Perl is reading the value of an expression like a global scalar variable.
The preinc operations occur at labels 6 and 8. The add operations occur at labels 9 and a. This tells us that both increments occurred before Perl performed the additions, and so the final expression would be something like 2 + (2 + 2) = 6.
In the other example, the opcodes look like
$ perl -MO=Concise -E 'say $b=($a+(++$a)) + (++$a)'
e <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 47 -e:1) v:%,{,469764096 ->3
d <#> say vK ->e
3 <0> pushmark s ->4
c <2> sassign sKS/2 ->d
a <2> add[t6] sK/2 ->b
7 <2> add[t4] sKP/2 ->8
- <1> ex-rv2sv sK/1 ->5
4 <#> gvsv[*a] s ->5
6 <1> preinc sKP/1 ->7
- <1> ex-rv2sv sKRM/1 ->6
5 <#> gvsv[*a] s ->6
9 <1> preinc sKP/1 ->a
- <1> ex-rv2sv sKRM/1 ->9
8 <#> gvsv[*a] s ->9
- <1> ex-rv2sv sKRM*/1 ->c
b <#> gvsv[*b] s ->c
-e syntax OK
Now the preinc operations still occur at 6 and 9, but there is an add operation at label 7, after $a has only be incremented one time. This makes the values used in the final expression (1 + 1) + 2 = 4.
So in your example:
$ perl -MO=Concise -E '$a=10;$b=(++$a)+(++$a)+(++$a);say $b'
l <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 47 -e:1) v:%,{,469764096 ->3
5 <2> sassign vKS/2 ->6
3 <$> const[IV 10] s ->4
- <1> ex-rv2sv sKRM*/1 ->5
4 <#> gvsv[*a] s ->5
6 <;> nextstate(main 47 -e:1) v:%,{,469764096 ->7
g <2> sassign vKS/2 ->h
e <2> add[t7] sK/2 ->f
b <2> add[t5] sK/2 ->c
8 <1> preinc sKP/1 ->9
- <1> ex-rv2sv sKRM/1 ->8
7 <#> gvsv[*a] s ->8
a <1> preinc sKP/1 ->b
- <1> ex-rv2sv sKRM/1 ->a
9 <#> gvsv[*a] s ->a
d <1> preinc sKP/1 ->e
- <1> ex-rv2sv sKRM/1 ->d
c <#> gvsv[*a] s ->d
- <1> ex-rv2sv sKRM*/1 ->g
f <#> gvsv[*b] s ->g
h <;> nextstate(main 47 -e:1) v:%,{,469764096 ->i
k <#> say vK ->l
i <0> pushmark s ->j
- <1> ex-rv2sv sK/1 ->k
j <#> gvsv[*b] s ->k
-e syntax OK
We see preinc occurring at labels 8, a, and d. The add operations occur at b and e. That is, $a is incremented twice, then two $a's are added together. Then $a is incremented again. Then $a is added to the result. So the output is (12 + 12) + 13 = 37.

perl string interpolation when using the package delimiter

Am fighting some legacy perl looking like the following:
sub UNIVERSAL::has_sub_class {
my ($package,$class) = #_;
my $all = all_packages();
print "$package - $class", "\n";
print "$package::$class", "\n";
return exists $all->{"$package::$class"};
}
On two different systems, two different PERL installations / versions, this code behaves differently, i.e. the "$package::$class" construct is properly resolved to the correct package name on one system, but not on the other.
The following different print outputs can be seen when running has_sub_class on the two different systems:
# print output on system 1 (perl v5.8.6):
webmars::parameter=HASH(0xee93d0) - webmars::parameter::date
webmars::parameter::date
# print output on system 2 (perl v5.18.1):
webmars::parameter=HASH(0x251c500) - webmars::parameter::date
webmars::parameter=HASH(0x251c500)::webmars::parameter::date
Has there been any string interpolation changes in between perl v5.8.6 and perl v5.18.1 that you know might cause this behaviour ? Or should I look somewhere else ? I've really tried google-ing around and reading through the perl change notes, but could not find anything of interest.
With my limited knowledge of perl, I've tried getting the smallest piece of code that could reproduce the problem I am having. I have come up with the following which I hope is relevant:
# system 1 (perl v5.8.6):
$ perl -e 'my %x=(),$x=bless(\%x),$y='bar';print "$x::$y\n";'
bar
# system 2 (perl v5.18.1):
$ perl -e 'my %x=(),$x=bless(\%x),$y='bar';print "$x::$y\n";'
main=HASH(0xec0ce0)::bar
The outputs are different ! Any ideas ?
Shorter demonstration:
($x::, $x) = (1,2); print "$x::$x"
$ perl5.16.3 -e '($x::, $x) = (1,2); print "$x::$x"'
12
$ perl5.18.1 -e '($x::, $x) = (1,2); print "$x::$x"'
2::2
Getting warmer.
$ perl5.16.3 -MO=Concise =e 'print "$x::$x"'
8 <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
7 <#> print vK ->8
3 <0> pushmark s ->4
- <1> ex-stringify sK/1 ->7
- <0> ex-pushmark s ->4
6 <2> concat[t3] sK/2 ->7
- <1> ex-rv2sv sK/1 ->5
4 <#> gvsv[*x::] s ->5 <- $x::
- <1> ex-rv2sv sK/1 ->6
5 <#> gvsv[*x] s ->6 <- $x
-e syntax OK
$ perl5.18.1 -MO=Concise -e 'print "$x::$x"'
a <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
9 <#> print vK ->a
3 <0> pushmark s ->4
- <1> ex-stringify sK/1 ->9
- <0> ex-pushmark s ->4
8 <2> concat[t4] sKS/2 ->9
6 <2> concat[t2] sK/2 ->7
- <1> ex-rv2sv sK/1 ->5
4 <#> gvsv[*x] s ->5 <- $x
5 <$> const[PV "::"] s ->6 <- "::"
- <1> ex-rv2sv sK/1 ->8
7 <#> gvsv[*x] s ->8 <- $x
-e syntax OK
TL;DR. v5.16 parses "$x::$x" as $x:: . $x. v5.18 as $x . "::" . $x. I don't see any obvious reference to this change in the delta docs, but I'll keep looking.
So, my very quick test confirms the issue - using
perl -Mstrict -we 'my %x=(),$x=bless(\%x),$y="bar";print "$x::$y\n";'
(With your version, I get a bareword warning for 'bar').
The error I get in 5.8.8 is "use of uninitialized value in concatenation".
The difference seems to be that when I run with perl -MO=Deparse I get:
my ( %x ) = ();
my $x = bless ( \%x );
my $y = 'bar';
print "$x::$y\n";
If I run on 5.20.2 though, I get:
my ( %x ) = ();
my $x = bless ( \%x );
my $y = 'bar';
print "${x}::$y\n";
So yes, there has been a change in how the same code is parsed. But I'm not entirely sure how that helps you, apart from perhaps enlightening you as to what's going on?

Is there any difference between &$func($arg) and $func->($arg)?

While trying to understand closures, reading thru perl-faq and coderef in perlref found those examples:
sub add_function_generator {
return sub { shift() + shift() };
}
my $add_sub = add_function_generator();
my $sum = $add_sub->(4,5);
and
sub newprint {
my $x = shift;
return sub { my $y = shift; print "$x, $y!\n"; };
}
$h = newprint("Howdy");
&$h("world");
here are two forms of calling a function stored in a variable.
&$func($arg)
$func->($arg)
Are those totally equivalent (only syntactically different) or here are some differences?
There is no difference. Proof: the opcodes generated by each version:
$ perl -MO=Concise -e'my $func; $func->()'
8 <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
3 <0> padsv[$func:1,2] vM/LVINTRO ->4
4 <;> nextstate(main 2 -e:1) v:{ ->5
7 <1> entersub[t2] vKS/TARG ->8
- <1> ex-list K ->7
5 <0> pushmark s ->6
- <1> ex-rv2cv K ->-
6 <0> padsv[$func:1,2] s ->7
$ perl -MO=Concise -e'my $func; &$func()'
8 <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
3 <0> padsv[$func:1,2] vM/LVINTRO ->4
4 <;> nextstate(main 2 -e:1) v:{ ->5
7 <1> entersub[t2] vKS/TARG ->8
- <1> ex-list K ->7
5 <0> pushmark s ->6
- <1> ex-rv2cv sKPRMS/4 ->-
6 <0> padsv[$func:1,2] s ->7
… wait, there are actually slight differences in the flags for - <1> ex-rv2cv sKPRMS/4 ->-. Anyways, they don't seemt to matter, and both forms behave the same.
But I would recommend to use the form $func->(): I perceive this syntax as more elegant, and you can't accidentally forget to use parens (&$func works but makes the current #_ visible to the function, which is not what you'd usually want).

Difference between $x->{a}{b} and $x->{a}->{b}

Is there any conceivable difference between
$x->{a}{b}
and
$x->{a}->{b}
for any allowed value of $x->{a}, in any of the perl versions >= 5.6?
No. This is just a syntactic shortcut without any semantic difference.
Proof: the opcodes that are produced upon compilation
$ perl -MO=Concise -e'$x->{a}{b}'
b <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
a <2> helem vK/2 ->b
8 <1> rv2hv[t3] sKR/1 ->9
7 <2> helem sKM/DREFHV,2 ->8
5 <1> rv2hv[t2] sKR/1 ->6
4 <1> rv2sv sKM/DREFHV,1 ->5
3 <#> gv[*x] s ->4
6 <$> const[PV "a"] s/BARE ->7
9 <$> const[PV "b"] s/BARE ->a
-e syntax OK
$ perl -MO=Concise -e'$x->{a}->{b}'
b <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3
a <2> helem vK/2 ->b
8 <1> rv2hv[t3] sKR/1 ->9
7 <2> helem sKM/DREFHV,2 ->8
5 <1> rv2hv[t2] sKR/1 ->6
4 <1> rv2sv sKM/DREFHV,1 ->5
3 <#> gv[*x] s ->4
6 <$> const[PV "a"] s/BARE ->7
9 <$> const[PV "b"] s/BARE ->a
-e syntax OK
See also perlref, section Using References, rule 3.
There are some places where -> didn't become optional until later than 5.6. I believe these are some:
$x->('a'){'b'} # coderef called, returning a reference
({a=>42})[0]{'a'} # reference from a list slice
The constructs are identical. Perl allows the -> between any pair of closing and opening brackets to be omitted.
This works, and prints OKOK.
use strict;
use warnings;
my $data = {
a => {
b => [
sub { { c => 'OK' } }
]
}
};
print $data->{a}->{b}->[0]->()->{c};
print $data->{a}{b}[0](){'c'};

Is a parse tree the same thing as bytecode?

What exactly is the difference between bytecode and a parse tree, specifically the one used by Perl? Do they actually refer to the same concept, or is there a distinction?
I'm familiar with the concept of bytecode from Python and Java, but when reading about Perl, I've learned that it supposedly executes a parse tree (instead of bytecode) in its interpreter.
If there actually is a distinction, what are the reasons for Perl not using bytecode (or Python not using parse trees)? Is it mainly historical, or are there differences between the languages that necessitate a different compilation/execution model? Could Perl (with reasonable effort and execution performance) be implemented by using a bytecode interpreter?
What Perl uses is not a parse tree, at least not how Wikipedia defines it. It's an opcode tree.
>perl -MO=Concise -E"for (1..10) { say $i }"
g <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 49 -e:1) v:%,{,2048 ->3
f <2> leaveloop vK/2 ->g
7 <{> enteriter(next->c last->f redo->8) lKS/8 ->d
- <0> ex-pushmark s ->3
- <1> ex-list lK ->6
3 <0> pushmark s ->4
4 <$> const[IV 1] s ->5
5 <$> const[IV 10] s ->6
6 <#> gv[*_] s ->7
- <1> null vK/1 ->f
e <|> and(other->8) vK/1 ->f
d <0> iter s ->e
- <#> lineseq vK ->-
8 <;> nextstate(main 47 -e:1) v:%,2048 ->9
b <#> say vK ->c
9 <0> pushmark s ->a
- <1> ex-rv2sv sK/1 ->b
a <#> gvsv[*i] s ->b
c <0> unstack v ->d
-e syntax OK
Except, despite being called a tree, it's not really a tree. Notice the arrows? It's because it's actually a list-like graph of opcodes (like any other executable).
>perl -MO=Concise,-exec -E"for (1..10) { say $i }"
1 <0> enter
2 <;> nextstate(main 49 -e:1) v:%,{,2048
3 <0> pushmark s
4 <$> const[IV 1] s
5 <$> const[IV 10] s
6 <#> gv[*_] s
7 <{> enteriter(next->c last->f redo->8) lKS/8
d <0> iter s
e <|> and(other->8) vK/1
8 <;> nextstate(main 47 -e:1) v:%,2048
9 <0> pushmark s
a <#> gvsv[*i] s
b <#> say vK
c <0> unstack v
goto d
f <2> leaveloop vK/2
g <#> leave[1 ref] vKP/REFC
-e syntax OK
The difference between Perl's opcodes and Java's bytecodes is that Java's bytecodes are designed to be serialisable (stored in a file).
Parse tree is the tokens of a program, stored in a structure that shows their nesting (which arguments belong to which function calls, what statements are inside which loops etc.) while bytecode is the program code converted into a binary notation for quicker execution in a virtual machine. For example if you had the following code in an imaginary language:
loop i from 1 to 10 {
print i
}
The parse tree might look for example like:
loop
variable
i
integer
1
integer
10
block
print
variable
i
Meanwhile the bytecode in raw and symbolic form, compiled for a stack-oriented virtual machine might look like:
0x01 0x01 PUSH 1
START:
0x02 DUP
0x03 PRINT
0x05 INCREMENT
0x02 DUP
0x01 0x0a PUSH 10
0x04 LESSTHAN
0x06 0xf9 JUMPCOND START
When compiling a program, you'd first need to parse the source code (usually producing a parse tree) and then transform it to bytecode. It can be easier to skip the second step and execute straight from the parse tree. Also, if the language syntax is very hairy (for example it allows modifying the code) then producing the bytecode gets more complicated. If there's an eval type function to execute any code, the whole compiler has to be distributed with the application to use the virtual machine for such code. Including only a parser is simpler.
In Perl 6, the next version of perl, the code is supposed to be compiled to bytecode and run on the Parrot virtual machine. It's expected to improve performance. Bytecode is fairly straightforward to compile further to the processor's native instructions (this is called a JIT compiler) to approach speed of compiled languages like C.