% perl -Ilib -MDevel::Peek -le '$a="34567"; $a=~s/...//; Dump($a)'
SV = PV(0x8171048) at 0x8186f48 # replaced "12345" with "34567"
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
OFFSET = 3
PV = 0x8181bdb ( "34\003" . ) "67"\0
CUR = 2
LEN = 9
Where do the 2 zeros in the chomped part ( "12\003" . ) between 2 and 3 come from?
Why do I get this kind of output in the chomped part ( "34\003" . )?
A bug? "\003" is chr(3) in octal form. However:
$ perl -Ilib -MDevel::Peek -le '$a="12345"; $a=~s/...//; Dump($a)'
SV = PVIV(0x869b0bc) at 0x86a5060
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 3 (OFFSET)
PV = 0x869fac3 ( "123" . ) "45"\0
CUR = 2
LEN = 5
I can't duplicate that; what version of perl are you using?
Note that the part of the string buffer in () is reserved but not currently in use.
I am getting same result as sid_com using perl 5.12.2 on Windows. However the string length is taken from CUR field of structure anyway. I don't see why this should be a bug, there can be any bytes in rest of string buffer.
Related
I have created a very small sample code below to illustrate how Perl's index() function's return value changes for empty substr ("") on string that is passed or not passed through Encode::decode().
use strict;
use Encode;
my $mainString = (#ARGV >= 2) ? $ARGV[1] : "abc";
my $subString = (#ARGV >= 3) ? $ARGV[2] : "";
if (#ARGV >= 1) {
$mainString = Encode::decode("utf8", $mainString);
}
my $position = index($mainString, $subString, 0);
my $loopCount = 0;
my $stopLoop = 7; # It goes for ever so set a stopping value
while ($position >= 0) {
if ($loopCount >= $stopLoop) {
last;
}
$loopCount++;
print "[$loopCount]: $position \"$mainString\" [".length($mainString)."] ($subString)\n";
$position = index($mainString, $subString, $position + 1);
}
Before getting into with vs without Encode::decode(), what should the return value of index() be for an empty substr ("") because Perl's documentation does not mention it. Although it does not mention it, here is the execution result without calling Encode::decode() for ASCII characters "abc" (#ARGV = 0):
>perl StringIndex.pl
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 3 "abc" [3] ()
[6]: 3 "abc" [3] ()
[7]: 3 "abc" [3] ()
However when encoding is involved, the return value changes. The return value changes as if the string being searched is not bounded by its length when called with Encode::decode() for ASCII characters "abc" ($ARGV[0] = 1):
>perl StringIndex.pl 1
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 4 "abc" [3] ()
[6]: 5 "abc" [3] ()
[7]: 6 "abc" [3] ()
As a Side Note:
substr is set to empty string ("") in above example, but in my real program it is a variable that changes value depending on condition.
I understand the simplest solution is to check if substr is empty and not enter the while loop
I am using "This is perl 5, version 28, subversion 1 (v5.28.1) built
for MSWin32-x64-multi-thread"
This would be considered a bug, which I've reported here.
Minimal code to reproduce:
use strict;
use warnings;
no warnings qw( void );
use feature qw( say );
my $s = "abc";
my $len = length($s);
utf8::upgrade($s);
length($s) if $ARGV[0];
say index($s, "", $len+1);
$ perl a.pl 0
3
$ perl a.pl 1
4
Perl has two string storage formats. The "upgraded" format, and the "downgraded" format.
Encode::decode always return an upgraded string. And utf8::upgrade tells Perl to switch the storage format used by a scalar.
Each character of a downgraded string can store a number between 0 and 255. Each character of the string is stored as a byte of the appropriate value. This, of course, is fine if you have bytes or ASCII text. But this is insufficient for arbitrary text.
Each character of an upgraded string can store a number between 0 and 232-1 or between 0 and 264-1 depending on how your Perl was compiled. This is more than enough to store any Unicode Code Point (even those outside the BMP). Each character is encoded using "utf8", a nonstandard extension of UTF-8.
utf8 (like UTF-8) is variable-length encoding. This presents two problems:
Determining the length of an upgraded string requires iterating over the entire string.
Determining the position of characters in a upgraded string requires iterating over the entire string.
Let's consider the following snippet:
index($str, $substr, $pos)
With a downgraded string, index can jump directly to the position indicated by $pos. It's a question of simple pointer arithmetic.
But because each character of an upgrade string can require a different amount of storage, index can't use pointer arithmetic to find the character at position $pos. Without optimizations, each use to index would have to start at offset 0 and move through the string until it finds the character indicated by $pos.
That would be unfortunate. Imagine if index was being used in a loop to find all matches. So Perl optimizes this! When the length of an upgraded string becomes known, Perl attaches it to the scalar.
$ perl -MDevel::Peek -e'
$s = "abc";
utf8::upgrade($s);
Dump($s);
length($s);
Dump($s);
'
SV = PV(0x56483dda7e80) at 0x56483ddd5ba0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
SV = PVMG(0x56483de0ecf0) at 0x56483ddd5ba0
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
MAGIC = 0x56483ddd4050
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 3 <-- Attached length
Similarly, the offset of characters is sometimes attached to the scalar as well!
$ perl -MDevel::Peek -e'
$s = "abc";
utf8::upgrade($s);
Dump($s);
index($s, "", 2);
Dump($s);
'
SV = PV(0x558d5c970e80) at 0x558d5c99ebc0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
SV = PVMG(0x558d5c9d7d10) at 0x558d5c99ebc0
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
MAGIC = 0x558d5c9af690
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = -1
MG_PTR = 0x558d5c99cb80
0: 2 -> 2 <-- Attached character offset
1: 0 -> 0 <-- Attached character offset
The difference in behaviour is due to different code being paths in the code being exercised based on the string format and what information is cached.
I define a Perl module, like so:
#!/usr/bin/env perl
use strict;
use warnings;
package Sample;
use Data::Dumper;
our $VERSION = v1.10;
sub VERSION
{
my ($class, $version) = #_;
print ("version is $version\n");
print Dumper ($version);
}
The nature of the value passed in $version changes depending on how the module is imported:
$ perl -e 'use Sample 1.0'
version is 1
$VAR1 = '1';
However, if the required module version is specified as a v-string:
$ perl -e 'use Sample v1.0'
version is
$VAR1 = v1.0;
What data type is being passed in $version in the second case? It's apparently not a simple scalar, and it's not a reference.
A v string is a string. Each number is assumed to be a Unicode code point and is converted to that character so what you are actually printing out is chr(1) . chr(0). You can prove this with the following script
my $vstring = v80.101.114.108
print $vstring, "\n";
This will print Perl
Each dot-separated number is converted into a character with the ordinal value of the number.[1] In other words,
v1.0 ≡ "\x01\x00" ≡ chr(1).chr(0) ≡ pack('W*', 1, 0)
You can convert a v-string into something human readable using the %vd format specifier of sprintf.[2]
$ perl -e'CORE::say sprintf("%vd", v1.0)'
1.0
But it's better to use the version module.
$ perl -Mversion -e'CORE::say version->parse(v1.0)'
v1.0
It's better because the version module can handle version strings in general (not just v-strings).
$ perl -Mversion -e'
my $v1 = version->parse(1.0);
my $v2 = version->parse("1.0");
my $v3 = version->parse(v1.0);
my $v4 = version->parse("v1.0");
CORE::say "equal"
if $v1 == $v2
&& $v1 == $v3
&& $v1 == $v4
'
equal
One can use any numerical or string comparison operator[3] to compare version objects.
It's more than that, though. A scalar containing a v-string has magic (of type V) applied, so it's possible to dectect that it's a v-string.
$ perl -MDevel::Peek -e'Dump("\x01\x00"); Dump(v1.0);'
SV = PV(0xbc9d70) at 0xbe7998
REFCNT = 1
FLAGS = (POK,IsCOW,READONLY,PROTECT,pPOK)
PV = 0xbf1ed0 "\1\0"\0
CUR = 2
LEN = 10
COW_REFCNT = 0
SV = PVMG(0xc20480) at 0xbe7938
REFCNT = 1
FLAGS = (RMG,POK,IsCOW,READONLY,PROTECT,pPOK)
IV = 0
NV = 0
PV = 0xbf0190 "\1\0"\0
CUR = 2
LEN = 10
COW_REFCNT = 0
MAGIC = 0xbf3a80
MG_VIRTUAL = 0
MG_TYPE = PERL_MAGIC_vstring(V)
MG_LEN = 4
MG_PTR = 0xbf1700 "v1.0"
This magic is even applied to any scalar to which the v-string is copied!
$ perl -MDevel::Peek -e'my $v1 = v1.0; my $v2 = $v1; Dump($v2)'
SV = PVMG(0x9dc500) at 0x9a3a00
REFCNT = 1
FLAGS = (RMG,POK,IsCOW,pPOK)
IV = 0
NV = 0
PV = 0x9ac1b0 "\1\0"\0
CUR = 2
LEN = 10
COW_REFCNT = 2
MAGIC = 0x9b8090
MG_VIRTUAL = 0
MG_TYPE = PERL_MAGIC_vstring(V)
MG_LEN = 4
MG_PTR = 0x9adef0 "v1.0"
I believe the version module takes advantage of this information.
This format specifier works on any string, so it's convenient for checking for hidden or special characters when debugging.
$ perl -e'CORE::say sprintf "%v02X", "abc\r\n"'
61.62.63.0D.0A
==, <, >, <=, >=, <=>, eq, lt, gt, le, ge and cmp.
Disclaimer: It's been ages since I've done any perl, so if I'm asking/saying something stupid please correct me.
Is it possible to view a byte/bit representation of a perl variable? That is, if I say something like
my $foo = 'a';
I know (think?) the computer sees $foo as something like
0b1100010
Is there a way to get perl to print out the binary representation of a variable?
(Not asking for any practical purpose, just tinkering around with a old friend and trying to understand it more deeply than I did in 1997)
Sure, using unpack:
print unpack "B*", $foo;
Example:
% perl -e 'print unpack "B*", "bar";'
011000100110000101110010
The perldoc pages for pack and perlpacktut give a nice overview about converting between different representations.
The place to start if you want the actual internals is a document called "perlguts". Either perldoc perlguts or read it here: http://perldoc.perl.org/perlguts.html
After seeing the way that Andy interpreted your question, I can follow up by saying that Devel::Peek has a Dump function which can show the internal representation of a variable. It won't take it to the binary level, but if what you are interested in is the internals, you might look at this.
$ perl -MDevel::Peek -e 'my $foo="a";Dump $foo';
SV = PV(0x7fa8a3004e78) at 0x7fa8a3031150
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x7fa8a2c06190 "a"\0
CUR = 1
LEN = 16
$ perl -MDevel::Peek -e 'my %bar=(x=>"y",a=>"b");Dump \%bar'
SV = IV(0x7fbc5182d6e8) at 0x7fbc5182d6f0
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x7fbc51831168
SV = PVHV(0x7fbc5180c268) at 0x7fbc51831168
REFCNT = 2
FLAGS = (PADMY,SHAREKEYS)
ARRAY = 0x7fbc5140f9f0 (0:6, 1:2)
hash quality = 125.0%
KEYS = 2
FILL = 2
MAX = 7
RITER = -1
EITER = 0x0
Elt "a" HASH = 0xca2e9442
SV = PV(0x7fbc51804f78) at 0x7fbc51807340
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7fbc5140fa60 "b"\0
CUR = 1
LEN = 16
Elt "x" HASH = 0x9303a5e5
SV = PV(0x7fbc51804e78) at 0x7fbc518070d0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7fbc514061a0 "y"\0
CUR = 1
LEN = 16
And one more way:
printf "%v08b\n", 'abc';
output:
01100001.01100010.0110001
(The v flag is a perl-only printf/sprintf feature and also works with numeric formats other than b.)
This differs from the unpack suggestion where there are characters greater than "\xff": unpack will only return the 8 low bits (with a warning), printf '%v...' will show all the bits:
$ perl -we'printf "%vX\n", "\cA\13P\x{1337}"'
1.B.50.1337
You can use ord to return the numeric value of a character, and printf with a %b format to display that value in binary.
print "%08b\n”, ord 'a'
output
01100010
I am looking at Perl script written by someone else, and I found this:
$num2 = '000000';
substr($num2, length($num2)-length($num), length($num)) = $num;
my $id_string = $text."_".$num2
Forgive me ignorance, but for an untrained Perl programmer the second line looks as if the author is assigning the string $num to the result of the function substr. What does this line exactly do?
Exactly what you think it would do:
$ perldoc -f substr
You can use the substr() function as an lvalue, in which case
EXPR must itself be an lvalue. If you assign something shorter
than LENGTH, the string will shrink, and if you assign
something longer than LENGTH, the string will grow to
accommodate it. To keep the string the same length, you may
need to pad or chop your value using "sprintf".
In Perl, (unlike say, Python, where strings, tuples are not modifiable in-place), strings can be modified in situ. That is what substr is doing here, it is modifying only a part of the string. Instead of this syntax, you can use the more cryptic syntax:
substr($num2, length($num2)-length($num), length($num),$num);
which accomplishes the same thing. You can further stretch it. Imagine you want to replace all instances of foo by bar in a string, but only within the first 50 characters. Perl will let you do it in a one-liner:
substr($target,0,50) =~ s/foo/bar/g;
Great, isn't it?
"Exactly", you ask?
Normally, substr returns a boring string (PV with POK).
$ perl -MDevel::Peek -e'$_="abcd"; Dump("".substr($_, 1, 2));'
SV = PV(0x99f2828) at 0x9a0de38
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x9a12510 "bc"\0
CUR = 2
LEN = 12
However, when substr is evaluated where an lvalue (assignable value) is expected, it returns a magical scalar (PVLV with GMG (get magic) and SMG (set magic)).
$ perl -MDevel::Peek -e'$_="abcd"; Dump(substr($_, 1, 2));'
SV = PVLV(0x8941b90) at 0x891f7d0
REFCNT = 1
FLAGS = (TEMP,GMG,SMG)
IV = 0
NV = 0
PV = 0
MAGIC = 0x8944900
MG_VIRTUAL = &PL_vtbl_substr
MG_TYPE = PERL_MAGIC_substr(x)
TYPE = x
TARGOFF = 1
TARGLEN = 2
TARG = 0x8948c18
FLAGS = 0
SV = PV(0x891d798) at 0x8948c18
REFCNT = 2
FLAGS = (POK,pPOK)
PV = 0x89340e0 "abcd"\0
CUR = 4
LEN = 12
This magical scalar holds the parameters passed to susbtr (TARG, TARGOFF and TARGLEN). You can see the scalar pointed by TARG (the original scalar passed to substr) repeated at the end (the SV at 0x8948c18 you see at the bottom).
Any read of this magical scalar results in an associated function to be called instead. Similarly, a write calls a different associated function. These functions cause the selected part of the string passed to substr to be read or modified.
perl -E'
$_ = "abcde";
my $ref = \substr($_, 1, 3); # $$ref is magical
say $$ref; # bcd
$$ref = '123';
say $_; # a123e
'
Looks to me like it's overwriting the last length($num) characters of $num2 with the contents of $num in order to get a '0' filled number.
I imagine most folks would accomplish this same task w/ sprintf()
What's happening behind the scenes when I do a concatenation on a string?
my $short = 'short';
$short .= 'cake';
Is Perl effectively creating a new string, then assigning it the correct variable reference, or are Perl strings always mutable by nature?
The motivation for this question came from a discussion I had with a colleague, who said that scripting languages can utilize immutable strings.
Perl strings are mutable. Perl automatically creates new buffers, if required.
use Devel::Peek;
my $short = 'short';
Dump($short);
Dump($short .= 'cake');
Dump($short = "");
SV = PV(0x28403038) at 0x284766f4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x28459078 "short"\0
CUR = 5
LEN = 8
SV = PV(0x28403038) at 0x284766f4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x28458120 "shortcake"\0
CUR = 9
LEN = 12
SV = PV(0x28403038) at 0x284766f4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x28458120 ""\0
CUR = 0
LEN = 12
Note that no new buffer is allocated in the third case.
Perl strings are definitely mutable. Each will store an allocated buffer size in addition to the used length and beginning offset, and the buffer will be expanded as needed. (The beginning offset is useful to allow consumptive operations like s/^abc// to not have to move the actual data.)
$short = 'short';
print \$short;
$short .= 'cake';
print \$short;
After executing this code I get "SCALAR(0x955f468)SCALAR(0x955f468)". My answer would be 'mutable'.