How are scalars stored 'under the hood' in perl? - perl

The basic types in perl are different then most languages, with types being scalar, array, hash (but apparently not subroutines, &, which I guess are really just scalar references with syntactical sugar). What is most odd about this is that the most common data types: int, boolean, char, string, all fall under the basic data type "scalar". It seems that perl decides rather to treat a scalar as a string, boolean, or number based off of the operator that modifies it, implying the scalar itself is not actually defined as "int" or "String" when saved.
This makes me curious as to how these scalars are stored "under the hood", particularly in regards to it's effect on efficiency (yes I know scripting languages sacrifice efficiency for flexibility, but they still need to be as optimized as possible when flexibility concerns are not affected). It's much easier for me to store the number 65535 (which takes two bytes) then the string "65535" which takes 6 bytes, as such recognizing that $val = 65535 is storing an int would allow me to use 1/3 the memory, in large arrays this could mean fewer cache hits as well.
It's not just limited to saving memory of course. There are times when I can offer more significant optimizations if I know what type of scalar to expect. For instance if I have a hash using very large integers as keys it would be far faster to look up a value if I recognizing the keys as ints, allowing a simply modulo for creating my hash key, then if I have to run more complex hashing logic on a string that has 3 times the bytes.
So I'm wondering how perl handles these scalars under the hood. Does it store every value as a string, sacrificing the extra memory and cpu cost of constant converting string to int in the case that a scalar is always used as an int? Or does it have some logic for inference the type of scalar used to determine how to save and manipulate it?
Edit:
TJD linked to perlguts, which answers half my question. A scalar is actually stored as string, int (signed, unsigned, double) or pointer. I'm not too surprised, I had mostly expected this behavior to occur under the hood, though it's interesting to see the exact types. I'm leaving this question open though because perlguts is actually to low level. Other then telling me that 5 data types exist it doesn't specify how perl works to alternate between them, ie how perl decides which SV type to use when a scalar is saved and how it knows when/how to cast.

There are actually a number of types of scalars. A scalar of type SVt_IV can hold undef, a signed integer (IV) or an unsigned integer (UV). One of type SVt_PVIV can also hold a string[1]. Scalars are silently upgraded from one type to another as needed[2]. The TYPE field indicates the type of a scalar. In fact, arrays (SVt_AV) and hashes (SVt_HV) are really just types of scalars.
While the type of a scalar indicates what the scalar can contain, flags are used to indicate what a scalar does contain. This is stored in the FLAGS field. SVf_IOK signals that a scalar contains a signed integer, while SVf_POK indicates it contains a string[3].
Devel::Peek's Dump is a great tool for looking at the internals of scalars. (The constant prefixes SVt_ and SVf_ are omitted by Dump.)
$ perl -e'
use Devel::Peek qw( Dump );
my $x = 123;
Dump($x);
$x = "456";
Dump($x);
$x + 0;
Dump($x);
'
SV = IV(0x25f0d20) at 0x25f0d30 <-- SvTYPE(sv) == SVt_IV, so it can contain an IV.
REFCNT = 1
FLAGS = (IOK,pIOK) <-- IOK: Contains an IV.
IV = 123 <-- The contained signed integer (IV).
SV = PVIV(0x25f5ce0) at 0x25f0d30 <-- The SV has been upgraded to SVt_PVIV
REFCNT = 1 so it can also contain a string now.
FLAGS = (POK,IsCOW,pPOK) <-- POK: Contains a string (but no IV since !IOK).
IV = 123 <-- Meaningless without IOK.
PV = 0x25f9310 "456"\0 <-- The contained string.
CUR = 3 <-- Number of bytes used by PV (not incl \0).
LEN = 10 <-- Number of bytes allocated for PV.
COW_REFCNT = 1
SV = PVIV(0x25f5ce0) at 0x25f0d30
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK) <-- Now contains both a string (POK) and an IV (IOK).
IV = 456 <-- This will be used in numerical contexts.
PV = 0x25f9310 "456"\0 <-- This will be used in string contexts.
CUR = 3
LEN = 10
COW_REFCNT = 1
illguts documents the internal format of variables quite thoroughly, but perlguts might be a better place to start.
If you start writing XS code, keep in mind it's usually a bad idea to check what a scalar contains. Instead, you should request what should have been provided (e.g. using SvIV or SvPVutf8). Perl will automatically convert the value to the requested type (and warn if appropriate). API calls are documented in perlapi.
In fact, it can hold a string an either a signed integer or an unsigned integer at the same time.
All scalars (including arrays and hashes, excluding one type of scalar that can only hold undef) have two memory blocks at their base. Pointers to the scalar point to its head, which contains the TYPE field and a pointer to the body. Upgrading a scalar replaces the body of the scalar. That way, pointers to the scalar aren't invalidated by an upgrade.
An undef variable is one without any uppercase OK flags set.

The formats used by Perl for data storage are documented in the perlguts perldoc.
In short, though, a Perl scalar is stored as a SV structure containing one of a number of different types, such as an int, a double, a char *, or a pointer to another scalar. (These types are stored as a C union, so only one of them will be present at a time; the SV contains flags indicating which type is used.)
(With regard to hash keys, there's an important gotcha to note there: hash keys are always strings, and are always stored as strings. They're stored in a different type from other scalars.)
The Perl API includes a number of functions which can be used to access the value of a scalar as a desired C type. For example, SvIV() can be used to return the integer value of a SV: if the SV contains an int, that value is returned directly; if the SV contains another type, it's coerced to an integer as appropriate. These functions are used throughout the Perl interpreter for type conversions. However, there is no automatic inference of types on output; functions which operate on strings will always return a PV (string) scalar, for instance, regardless of whether the string "looks like" a number or not.
If you're curious what a given scalar looks like internally, you can use the Devel::Peek module to dump its contents.

Others have addressed the "how are scalars stored" part of your question, so I'll skip that. With regard to how Perl decides which representation of a value to use and when to convert between them, the answer is it depends on which operators are applied to the scalar. For example, given this code:
my $score = 0;
The scalar $score will be initialised with an integer value. But then when this line of code is run:
say "Your score is $score";
The double quote operator means that Perl will need a string representation of the value. So the conversion from integer to string will take place as part of the process of assembling the string argument to the say function. Interestingly, after the stringification of $score, the underlying representation of the scalar will now include both an integer and a string representation, allowing subsequent operations to directly grab the relevant value without having to convert again. If a numeric operator is then applied to the string (e.g.: $score++) then the numeric part will be updated and the (now invalid) string part will be discarded.
This is the reason why Perl operators tend to come in two flavours. For example comparing values of numbers is done with <, ==, > while performing the same comparisons with strings would be done with lt, eq, gt. Perl will coerce the value of the scalar(s) to the type which matches the operator. This is why the + operator does numeric addition in Perl but a separate operator . is needed to do string concatenation: + will coerce its arguments to numeric values and . will coerce to strings.
There are some operators that will work with both numeric and string values but which perform a different operation depending on the type of value. For example:
$score = 0;
say ++$score; # 1
say ++$score; # 2
say ++$score; # 3
$score = 'aaa';
say ++$score; # 'aaa'
say ++$score; # 'aab'
say ++$score; # 'aac'
With regard to questions of efficiency (and bearing in mind standard disclaimers about premature optimisation etc). Consider this code which reads a file containing one integer per line, each integer is validated to check it is exactly 8 digits long and the valid ones are stored in an array:
my #numbers;
while(<$fh>) {
if(/^(\d{8})$/) {
push #numbers, $1;
}
}
Any data read from a file will initially come to us as a string. The regex used to validate the data will also require a string value in $_. So the result is that our array #numbers will contain a list of strings. However, if further uses of the values will be solely in a numeric context, we could use this micro-optimisation to ensure that the array contained only numeric values:
push #numbers, 0 + $1;
In my tests with a file of 10,000 lines, populating #numbers with strings used nearly three times as much memory as populating with integer values. However as with most benchmarks, this has little relevance to normal day-to-day coding in Perl. You'd only need to worry about that in situations where you a) had performance or memory issues and b) were working with a large number of values.
It's worth pointing out that some of this behaviour is common to other dynamic languages (e.g.: Javascript will silently coerce numeric values to strings).

Related

Perl: Why can you use # or $ when accessing a specific element in an array?

I'm a novice Perl user and have not been able to find a satisfactory answer
my #foo = ("foo","bar")
print "#foo[0]"
foo
print "$foo[1]"
bar
Not only #foo[0] works as expected, but $foo[1] outputs a string as well.
Why? This is even when use strict is enabled.
Both #foo[0] and $foo[1] are legal Perl constructions.
$foo[$n], or more generally $foo[EXPR], is a scalar element, representing the $n-th (with the index starting at 0) element of the array #foo (or whatever EXPR evaluates to).
#foo[LIST] is an array slice, the set of elements of #foo indicated by indices in LIST. When LIST has one element, like #foo[0], then #foo[LIST] is a list with one element.
Although #foo[1] is a valid expression, experience has shown that that construction is usually used inappropriately, when it would be more appropriate to say $foo[1]. So a warning -- not an error -- is issued when warnings are enabled and you use that construction.
$foo[LIST] is also a valid expression in Perl. It's just that Perl will evaluate the LIST in scalar context and return the element of #foo corresponding to that evaluation, not a list of elements.
#foo = ("foo","bar","baz");
$foo[0] returns the scalar "foo"
#foo[1] returns the list ("bar") (and issues warning)
#foo[0,2] returns the list ("foo","baz")
$foo[0,2] returns "baz" ($foo[0,2] evaluates to $foo[2], and issues warning)
$foo[#foo] returns undef (evaluates to $foo[scalar #foo] => $foo[3])
Diversion: the only reason I could come up with to use #foo[SCALAR] is as an lvalue somewhere that distinguishes between scalar/list context. Consider
sub bar { wantarray ? (42,19) : 101 }
$foo[0] = bar(); would assign the value 101 to the 1st element of #foo, but #foo[0] = bar() would assign the value 42. But even that's not anything you couldn't accomplish by saying ($foo[0]) = bar(), unless you're golfing.
Perl sigils tell you how you are treating data and only loosely relate to variable type. Instead of thinking about what something is, think about what it is doing. It's verbs over nouns in Perl:
$ uses a single item
# uses multiple items
% uses pairs
The scalar variable $foo is a single item and uses the $ sigil. Since there aren't multiples or pairs for a single item, the other sigils don't come into play.
The array variable #foo is potentially many items and the # refers to all of those items together. The $foo[INDEX] refers to a single item in the array and uses the $ sigil. You can refer to multiple items with an array slice, such as #foo[INDEX,INDEX2,...] and you use the # for that. Your question focuses on the degenerate case of a slice of one element, #foo[INDEX]. That works, but it's in list context. Sometimes that behaves differently.
The hash variable %foo is a collection of key-value pairs. To get a single value, you use the $ again, like $foo{KEY}. You can also get more than one value with a hash slice, using the # because it's multiple values, like #hash{KEY1,KEY2,...}.
And, here's a recent development: Perl 5.20 introduces “Key/Value Slices”. You can get a hash slice of either an array or a hash. %array[INDEX1,INDEX2] or %hash{KEY1,KEY2}. These return a list of key-value pairs. In the array case, the keys are the indices.
For arrays and hashes with single element access or either type of slice, you know the variable type by the indexing character: arrays use [] and hashes use {}. And, here's the other interesting wrinkle: those delimiters supply scalar or list context depending on single or (potentially) multiple items.
Both #foo[0] and $foo[1] are legal Perl constructions.
$foo[EXPR] (where EXPR is an arbitrary expression evaluated in scalar context) returns the single element specified by the result of the expression.
#foo[LIST] (where LIST is an arbitrary expression evaluated in list context) returns every element specified by the result of the expression.
(Contrary to other posts, there's no such thing as $foo[LIST] in Perl. When using $foo[...], the expression is always evaluated in scalar context.)
Although #foo[1] is a valid expression, experience has shown that that construction is usually used inappropriately, when it would be more appropriate to say $foo[1]. So a warning — not an excption — is issued when warnings are enabled and you use that construction.
What this means:
my #foo = ( "foo", "bar", "baz" );
$foo[0] 0 eval'ed in scalar cx. Returns scalar "foo". ok
#foo[1] 1 eval'ed in list cx. Returns scalar "bar". Weird. Warns.
#foo[0,2] 0,2 eval'ed in list cx. Returns scalars "foo" and "baz". ok
$foo[0,2] 0,2 eval'ed in scalar cx. Returns scalar "baz". Wrong. Warns.
$foo[#foo] #foo eval'ed in scalar cx. Returns undef. Probably wrong.
The only reason I could come up with to use #foo[SCALAR] is as an lvalue somewhere that distinguishes between scalar/list context. Consider
sub bar { wantarray ? (42,19) : 101 }
$foo[0] = bar(); would assign the value 101 to the 1st element of #foo, but #foo[0] = bar(); would assign the value 42. It would be far more common to use ($foo[0]) = bar() instead.
Portions of the post Copyrighted by mob under the same terms as this site.
This post addresses numerous issues in mob's post, including 1) the misuse of LIST to mean something other than an arbitrary expression in list context, 2) pretending that parens creates lists, 3) pretending that there's a difference between return scalars and returning a list, and 4) pretending there's no such thing as a non-fatal error.

Why this function uses a lot of memory?

I'm trying to unpack binary vector of 140 Million bits into list.
I'm checking the memory usage of this function, but it looks weird. the memory usage rises to 35GB (GB and not MB). how can I reduce the memory usage?
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return #unpacked;
}
Scalars contain a lot of information.
$ perl -MDevel::Peek -e'Dump("0")'
SV = PV(0x42a8330) at 0x42c57b8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x42ce670 "0"\0
CUR = 1
LEN = 16
In order to keep them as small as possible, a scalar consists of two memory blocks[1], a fixed-sized head, and a body that can be "upgraded" to contain more information.
The smallest type of scalar that can contain a string (such as the ones returned by split) is a SVt_PV. (It's usually called PV, but PV can also refer to the name of the field that points to the string buffer, so I'll go with the name of the constant.)
The first block is the head.
ANY is a pointer to the body.
REFCNT is a reference count that allows Perl to know when the scalar can be deallocated.
FLAGS contains information about what the scalar actually contains. (e.g. SVf_POK means the scalar contains a string.)
TYPE contains information the type of scalar (what kind of information it can contain.)
For an SVt_PV, the last field points to the string buffer.
The second block is the body. The body of an SVt_PV has the following fields:
STASH is not used in the scalars in question since they're not objects.
MAGIC is not used for the scalars in question. Magic allows code to be called when the variable is accessed.
CUR is the length of the string in the buffer.
LEN is the length of the string buffer. Perl over-allocates to speed up concatenation.
The block on the right is the string buffer. As you might have noticed, Perl over-allocates. This speeds up concatenation.
Ignore the block on the bottom. It's an alternative to the string buffer format for special strings (e.g. hash keys).
To how much does that add up?
$ perl -MDevel::Size=total_size -E'say total_size("0")'
28 # 32-bit Perl
56 # 64-bit Perl
That's just for the scalar itself. It doesn't take into the overhead in the memory allocation system of three memory blocks.
These scalars are in an array. An array is really just a scalar.
So an array has overheard.
$ perl -MDevel::Size=total_size -E'say total_size([])'
56 # 32-bit Perl
64 # 64-bit Perl
That's an empty array. You have 140 million of the scalars in yours, so it needs a buffer that can contain 140 million pointers. (In this particular case, the array won't be over-allocated, at least.) Each pointer is 4 bytes on a 32-bit system, 8 on a 64.
That brings the total up to:
32-bit: 56 + (4 + 28) * 140,000,000 = 4,480,000,056
64-bit: 64 + (8 + 56) * 140,000,000 = 8,960,000,064
That doesn't factor in the memory allocation overhead, but it's still very different from the numbers you gave. Why? Well, the scalars returned by split are actually different than the scalars inside the array. So for a moment, you actually have 280,000,000 scalars in memory!
The rest of the memory is probably held by lexical variables in subs that aren't currently executing. Lexical variables aren't normally freed on scope exit since it's expected that the sub will need the memory the next time it's called. That means bin2list continues to use up 140MB of memory after it exits.
Footnotes
Scalars that are undefined can get away without a body until a value is assigned to them. Scalars that contain only an integer can get away without allocating a memory block for the body by storing the integer in the same field as a SVt_PV stores the pointer to the string buffer.
The images are from illguts. They are protected by Copyright.
A single integer value in Perl is going to be stored in an SVt_IV or SVt_UV scalar, whose size will be four machine-sized words - so on a 32bit machine, 16 bytes. An array of 140 million of those, therefore, is going to consume 2.2 billion bytes, presuming it is densely packed together. Add to that the SV * pointers in the AvARRAY used to reference them and we're now at 2.8 billion bytes. Now double that, because you copied the array when you returned it, and we're now at 5.6 billion bytes.
That of course was on a 32bit machine - on a 64bit machine we're at double again, so 11.2 billion bytes. This presumes totally dense packing inside the memory - in practice this will be allocated in stages and chunks, so RAM fragmentation will further add to this. I could imagine a total size around the 35 billion byte mark for this. It doesn't sound outlandishly unreasonable.
For a very easy way to massively reduce the memory usage (not to mention CPU time required), rather than returning the array itself as a list, return a reference to it. Then a single reference is returned rather than a huge list of 140 million SVs; this avoids a second copy also.
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return \#unpacked;
}

Behavior of Scalar::Util::looks_like_number in Perl

I am trying to find out if an input is number or string. I came across looks_like_number and cannot understand the values it returns.
use warnings;
use Scalar::Util qw(looks_like_number);
my $name = 11;
print looks_like_number ($name);
This code prints 1 if $name contains a string and a static number if $name contains an integer (i.e. 4352 for each integer).
I am using Perl on Windows.
You forgot to ask a question! Here are two possibilities.
Why doesn't it always return the same value for true?
Why not? It returns a true value as documented. It makes no difference which true value it is.
What is the value returned?
If the scalar contains a string, it uses grok_number which has specific document return values.
The type of the number is returned (0 if unrecognised), otherwise it is a bit-ORed combination of IS_NUMBER_IN_UV, IS_NUMBER_GREATER_THAN_UV_MAX, IS_NUMBER_NOT_INT, IS_NUMBER_NEG, IS_NUMBER_INFINITY, IS_NUMBER_NAN (defined in perl.h).
Otherwise, it uses
SvFLAGS(sv) & (SVf_NOK|SVp_NOK|SVf_IOK|SVp_IOK)
You can't tell which of the two was used, so you can't ascribe meaning to the value, which is why it's undocumented.
Don't rely on the exact numerical value. This is an abstraction leak, which the latest version of Scalar::Util (1.39) fixes. What is important is simply the truth of the result, not its exact numerical value.
See bug https://rt.cpan.org/Ticket/Display.html?id=94806
This is what the documentation says:
looks_like_number EXPR
Returns true if perl thinks EXPR is a number. See "looks_like_number" in perlapi.
The link to perlapi in this quote is not really helping us a lot unfortunately:
Test if the content of an SV looks like a number (or is a number). Inf
and Infinity are treated as numbers (so will not issue a non-numeric
warning), even if your atof() doesn't grok them. Get-magic is ignored.
I32 looks_like_number(SV *const sv)
In my case, your code will return an integer that is not 0, which is true.
I got 4352 when I used 11.
When I used '11' I got 1.
All of these are true, so that works.
When I put 'test' or 'foobar' I got 0, which is not true.
I never got a 1 for anything that did not look like a number.
I tried '1e1' and it printed 4, which is a true value, and the input looked like a number in scientific notation.
So, I'd say it always returns something true if Perl thinks the input looks like a number, though I do not know what exactly that true value represents. I cannot confirm that it also returns true with a name.

Perl autoincrement of string not working as before

I have some code where I am converting some data elements in a flat file. I save the old:new values to a hash which is written to a file at the end of processing. On subsequence execution, I reload into a hash so I can reuse previously converted values on additional data files. I also save the last conversion value so if I encounter an unconverted value, I can assign it a new converted value and add it to the hash.
I had used this code before (back in Feb) on six files with no issues. I have a variable that is set to ZCKL0 (last character is a zero) which is retrieved from a file holding the last used value. I apply the increment operator
...
$data{$olddata} = ++$dataseed;
...
and the resultant value in $dataseed is 1 instead of ZCKL1. The original starting seed value was ZAAA0.
What am I missing here?
Do you use the $dataseed variable in a numeric context in your code?
From perlop:
If you increment a variable that is
numeric, or that has ever been used in
a numeric context, you get a normal
increment. If, however, the variable
has been used in only string contexts
since it was set, and has a value that
is not the empty string and matches
the pattern /^[a-zA-Z][0-9]\z/ , the
increment is done as a string,
preserving each character within its
range.
As prevously mentioned, ++ on strings is "magic" in that it operates differently based on the content of the string and the context in which the string is used.
To illustrate the problem and assuming:
my $s='ZCL0';
then
print ++$s;
will print:
ZCL1
while
$s+=0; print ++$s;
prints
1
NB: In other popular programming languages, the ++ is legal for numeric values only.
Using non-intuitive, "magic" features of Perl is discouraged as they lead to confusing and possibly unsupportable code.
You can write this almost as succinctly without relying on the magic ++ behavior:
s/(\d+)$/ $1 + 1 /e
The e flag makes it an expression substitution.

Why can't I properly encode a boolean from PostgreSQL via JSON::XS via Perl?

I have a query on a PostgreSQL system returning a boolean:
my $sth = $dbh->prepare("select 'f'::boolean");
$sth->execute;
my #vals = $sth->fetchrow_array;
According to the DBD::Pg docs,
The current implementation of
PostgreSQL returns 't' for true and
'f' for false. From the Perl point of
view, this is a rather unfortunate
choice. DBD::Pg therefore translates
the result for the BOOL data type in a
Perlish manner: 'f' becomes the number
0 and 't' becomes the number 1. This
way the application does not have to
check the database-specific returned
values for the data-type BOOL because
Perl treats 0 as false and 1 as true.
You may set the pg_bool_tf attribute
to a true value to change the values
back to 't' and 'f' if you wish.
So, that statement should return a 0, which it does, so long as pg_bool_tf returns 0, which it does. However, somewhere along the way JSON::XS (and plain JSON) interprets the returned 0 as a string:
use JSON::XS qw(encode_json);
my $options =
{
layout => 0,
show_widget_help => $vals[0] // 1,
};
die encode_json($options);
...dies with:
{"layout":0,"show_widget_help":"0"}
...which would be fine, except that my JavaScript is expecting a boolean there, and the non-empty string "0" gets evaluated to true. Why is the latter 0 quoted and the former not?
According to the JSON::XS docs, this is a main feature:
round-trip integrity
When you serialise a perl data
structure using only data types
supported by JSON, the deserialised
data structure is identical on the
Perl level. (e.g. the string "2.0"
doesn't suddenly become "2" just
because it looks like a number). There
minor are exceptions to this, read the
MAPPING section below to learn about
those.
...which says:
Simple Perl scalars (any scalar that
is not a reference) are the most
difficult objects to encode: JSON::XS
will encode undefined scalars as JSON
null values, scalars that have last
been used in a string context before
encoding as JSON strings, and anything
else as number value.
But I never use #vals[0] in a string context. Maybe DBD::Pg uses its boolean 0 as a string somewhere before returning it?
The JSON::XS doc says the following will be converted to true/false
references to the integers 0 and 1, ie. \0 and \1
JSON::XS::true and JSON::XS::false
Using one of these should solve your problem