Why does Perl reallocate memory following this pattern? - perl

The memory addresses for anonymous arrays are naturally re-used by perl. As this example shows, they cycle between two addresses for empty arrays:
$ perl -E "say [] for (1..6)"
ARRAY(0x37b23c)
ARRAY(0x37b28c)
ARRAY(0x37b23c)
ARRAY(0x37b28c)
ARRAY(0x37b23c)
ARRAY(0x37b28c)
I came up with some theories on why it couldn't reallocate the memory immediately, when I found that the cycle isn't always two addresses long. The following examples' cycles are 3 and 4.
$ perl -E "say [0] for (1..6)"
ARRAY(0x39b23c)
ARRAY(0x39b2ac)
ARRAY(0x39b28c)
ARRAY(0x39b23c)
ARRAY(0x39b2ac)
ARRAY(0x39b28c)
$ perl -E "say [0,0] for (1..6)"
ARRAY(0x64b23c)
ARRAY(0x64b2cc)
ARRAY(0x64b2ac)
ARRAY(0x64b28c)
ARRAY(0x64b23c)
ARRAY(0x64b2cc)
What causes this peculiarity of memory management?

When SVs are freed, the are actually put into a "free" pool. Perhaps the order into which they enter the pool affects the order in which they exit.

Within the set of examples you've given, the number of addresses is not "two, or sometimes more". It's "the number of elements in the anonymous array, plus two". As ikegami said, the SVs go into a pool when freed, so it is to be expected that the addresses will cycle in some fashion, unless a deliberate effort has been made to retrieve them in a random order (which has obviously not been done).
The remaining question, then, is why the length of the cycle is "number of elements + 2". Perhaps it's using one SV for each element of the array, one for the arrayref itself, and one for $_?

Related

Does Perl's Glob have a limitation?

I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28
The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.
Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}

Perl variables defined with * vs $

What's the difference between defining a variable with a * vs a $? For example:
local $var;
local *var;
The initial character is known as a sigil, and says what sort of value the identifier represents. You will know most of them. Here's a list
Dollar $ is a scalar value
At sign # is an array value
Percent % is a hash value
Ampersand & is a code value
Asterisk * is a typeglob
You are less likely to have come across the last two recently, because & hasn't been necessary when calling subroutines since Perl 5.0 was released. And typeglobs are a special type that contains all of the other types, and are much more rarely used.
I'm considering how much deeper to go into all of this, but will leave my answer as it is for now. I may write more depending on the comments that arise.
$var is a scalar. *var is a typeglob. http://perldoc.perl.org/perldata.html#Typeglobs-and-Filehandles
It's not a variable in the strictest sense. You shouldn't generally be using it.

Why this function uses a lot of memory?

I'm trying to unpack binary vector of 140 Million bits into list.
I'm checking the memory usage of this function, but it looks weird. the memory usage rises to 35GB (GB and not MB). how can I reduce the memory usage?
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return #unpacked;
}
Scalars contain a lot of information.
$ perl -MDevel::Peek -e'Dump("0")'
SV = PV(0x42a8330) at 0x42c57b8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x42ce670 "0"\0
CUR = 1
LEN = 16
In order to keep them as small as possible, a scalar consists of two memory blocks[1], a fixed-sized head, and a body that can be "upgraded" to contain more information.
The smallest type of scalar that can contain a string (such as the ones returned by split) is a SVt_PV. (It's usually called PV, but PV can also refer to the name of the field that points to the string buffer, so I'll go with the name of the constant.)
The first block is the head.
ANY is a pointer to the body.
REFCNT is a reference count that allows Perl to know when the scalar can be deallocated.
FLAGS contains information about what the scalar actually contains. (e.g. SVf_POK means the scalar contains a string.)
TYPE contains information the type of scalar (what kind of information it can contain.)
For an SVt_PV, the last field points to the string buffer.
The second block is the body. The body of an SVt_PV has the following fields:
STASH is not used in the scalars in question since they're not objects.
MAGIC is not used for the scalars in question. Magic allows code to be called when the variable is accessed.
CUR is the length of the string in the buffer.
LEN is the length of the string buffer. Perl over-allocates to speed up concatenation.
The block on the right is the string buffer. As you might have noticed, Perl over-allocates. This speeds up concatenation.
Ignore the block on the bottom. It's an alternative to the string buffer format for special strings (e.g. hash keys).
To how much does that add up?
$ perl -MDevel::Size=total_size -E'say total_size("0")'
28 # 32-bit Perl
56 # 64-bit Perl
That's just for the scalar itself. It doesn't take into the overhead in the memory allocation system of three memory blocks.
These scalars are in an array. An array is really just a scalar.
So an array has overheard.
$ perl -MDevel::Size=total_size -E'say total_size([])'
56 # 32-bit Perl
64 # 64-bit Perl
That's an empty array. You have 140 million of the scalars in yours, so it needs a buffer that can contain 140 million pointers. (In this particular case, the array won't be over-allocated, at least.) Each pointer is 4 bytes on a 32-bit system, 8 on a 64.
That brings the total up to:
32-bit: 56 + (4 + 28) * 140,000,000 = 4,480,000,056
64-bit: 64 + (8 + 56) * 140,000,000 = 8,960,000,064
That doesn't factor in the memory allocation overhead, but it's still very different from the numbers you gave. Why? Well, the scalars returned by split are actually different than the scalars inside the array. So for a moment, you actually have 280,000,000 scalars in memory!
The rest of the memory is probably held by lexical variables in subs that aren't currently executing. Lexical variables aren't normally freed on scope exit since it's expected that the sub will need the memory the next time it's called. That means bin2list continues to use up 140MB of memory after it exits.
Footnotes
Scalars that are undefined can get away without a body until a value is assigned to them. Scalars that contain only an integer can get away without allocating a memory block for the body by storing the integer in the same field as a SVt_PV stores the pointer to the string buffer.
The images are from illguts. They are protected by Copyright.
A single integer value in Perl is going to be stored in an SVt_IV or SVt_UV scalar, whose size will be four machine-sized words - so on a 32bit machine, 16 bytes. An array of 140 million of those, therefore, is going to consume 2.2 billion bytes, presuming it is densely packed together. Add to that the SV * pointers in the AvARRAY used to reference them and we're now at 2.8 billion bytes. Now double that, because you copied the array when you returned it, and we're now at 5.6 billion bytes.
That of course was on a 32bit machine - on a 64bit machine we're at double again, so 11.2 billion bytes. This presumes totally dense packing inside the memory - in practice this will be allocated in stages and chunks, so RAM fragmentation will further add to this. I could imagine a total size around the 35 billion byte mark for this. It doesn't sound outlandishly unreasonable.
For a very easy way to massively reduce the memory usage (not to mention CPU time required), rather than returning the array itself as a list, return a reference to it. Then a single reference is returned rather than a huge list of 140 million SVs; this avoids a second copy also.
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return \#unpacked;
}

What do these lines in `dna2protein.pl` do?

I'm a newbie to perl and I found a script to convert a DNA sequence to protein sequence using Perl. I don't understand what some lines in that script do, specially the following:
my(%g)=('TCA'=>'S','TCC'=>'S','TCG'=>'S','TCT'=>'S','TTC'=>'F','TTT'=>'F','TTA'=>'L','TTG'=>'L','TAC'=>'Y','TAT'=>'Y','TAA'=>'_','TAG'=>'_','TGC'=>'C','TGT'=>'C','TGA'=>'_','TGG'=>'W','CTA'=>'L','CTC'=>'L','CTG'=>'L','CTT'=>'L','CCA'=>'P','CCC'=>'P','CCG'=>'P','CCT'=>'P','CAC'=>'H','CAT'=>'H','CAA'=>'Q','CAG'=>'Q','CGA'=>'R','CGC'=>'R','CGG'=>'R','CGT'=>'R','ATA'=>'I','ATC'=>'I','ATT'=>'I','ATG'=>'M','ACA'=>'T','ACC'=>'T','ACG'=>'T','ACT'=>'T','AAC'=>'N','AAT'=>'N','AAA'=>'K','AAG'=>'K','AGC'=>'S','AGT'=>'S','AGA'=>'R','AGG'=>'R','GTA'=>'V','GTC'=>'V','GTG'=>'V','GTT'=>'V','GCA'=>'A','GCC'=>'A','GCG'=>'A','GCT'=>'A','GAC'=>'D','GAT'=>'D','GAA'=>'E','GAG'=>'E','GGA'=>'G','GGC'=>'G','GGG'=>'G','GGT'=>'G');
if(exists $g{$codon})
{
return $g{$codon};
}
else
{
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
Can someone please explain?
My perl is rusty but anyway.
The first line creates a hash (which is perls version of a hash table). The variable is called g (a bad name BTW). The % sigil before g is used to indicate that it is a hash. Perl uses sigils to denote types. The hash is initialises using the double barrelled arrow syntax. 'TTT'=>'F' creates an entry TTT in the hash table with value F. The my is used to give the variable a local scope.
The next few lines are fairly self explanatory. It will check whether the hash contains an entry with key $codon. The $ sigil is used to indicate that it's a scalar value. If if exists, you get the value. Otherwise, it prints the message specified to the standard error.
Since you're new to Perl, you should read a little about Perl itself before you try to decrypt it's syntax on your own. (Perl values a good Huffman encoding, and is also somewhat encrypted. ;-)Start with the 'perldoc perlintro' command, and go from there. If you're using Ubunutu, for instance, this documentation can be installed via
$ sudo apt-get install perl-doc
but it is also available in this file: Perl Reference documentation
In addition to perlintro, some other suggested reading is perlsyn (syntax description), perldata (data structures), perlop (operators, including quotes), perlreftut (intro to references), and perlvar (predefined variables and their meanings), in roughly that order.
I learnt perl from these, and I still refer to them often.
Also, if your DNA script has POD documentation, then you can view that neatly by typing
$ perldoc <script-filename>
(of course, POD documentation is listed in the source, in a rougher form; read perlpod for more details on documentation fromat)
If you are new to Perl with an interest to understand more quickly, you might begin with this web collection learn.perl. A nice supplement is the online Perl documentation of perldoc. Good luck and have fun.
In this case it looks like the %g hash serves as both a way to identify whether a codon is within the set of valid condons (hash keys) and for some mapping to what type of codon it is (hash value).
Hashes serve as a way to link unique keys with a value, but they also serve as unique lists of keys. In some cases you may see keys added to a hash and set to undef. This is a good sign that the hash is being used to track unique values of some type.
The codon is being passed in to the function, upper cased and then a hash of codons is checked to see if there is codon of that value registered. If the codon exists the registered value for that codon is returned, otherwise an error is outputed and the program ends.
the my (%g) is creating a hash, which is a structure that allows you to quickly look up a value by giving a key for that value. So for instance 'TCA'=>'S' maps the value 'S' to 'TCA'. If you ask the g hash for the value held for 'TCA' you will get 'S' ($g{'TCA'} //will equal 'S' )

Limiting the amount of information printed by Perl debugger

One of my pet peeves with debugging Perl code (in command line debbugger, perl -d) is the fact that mistakenly printing (via x command) the contents of a huge datastructure is guaranteed to freeze up your terminal for forever and a half while 100s of pages of data are printed. Epecially if that happens across slowish network.
As such, I'd like to be able to limit the amount of data that x prints.
I see two approaches - I'd be willing to try either if someone knows how to do.
Limit the amount of data any single command in debugger prints.
Better yet, somehow replace the built-in x command with a custom Perl method (which would calculate the "size" of the data structure, and refuse to print its contents without confirmation).
I'm specifically asking "how to replace x with custom code" - building a Good Enough "is the data structure too big" Perl method is something I can likely do on my own without too much effort although I see enough pitfalls preventing the "perfect" one from being a fairly frustrating endeavour. Heck, merely doing Data::Dumper->Dump and taking the length of the string might do the trick :)
Please note that I'm perfectly well aware of how to manually avoid the issue by recursively examining layers of datastructure (e.g. print the ref, print the count of keys/array elements, etc...)... the whole point is I want to be able to avoid thoughtlessly typing x $huge_pile_of_data without thinking - or stumbling on a bug populating said huge pile of data into what should be a scalar.
The x command takes an optional argument for the maximum depth to display. That's not quite the same as limiting the amount of data to N pages, but it's definitely useful to prevent overload.
DB<1> %h = (a => { b => { c => 1 } } )
DB<2> x %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
'c' => 1
DB<3> x 2 %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
You can specify the default depth to print via the o command, e.g.
DB<1>o dumpDepth=1
Add that to your .perldb file to apply it to all debugger sessions.
Otherwise, it looks like the x command invokes DB::dumpit() which is just a wrapper for dumpval.pl (or, more specifically, the main::dumpValue() sub it defines). You could modify/replace that script as you see fit. I'm not sure how you'd make it interactive, though.
The | command in the debugger pipes another command's output to your pager, e.g.
DB<1> |x %huge_datastructure