In Perl, how can I release memory to the operating system? - perl

I am having some problems with memory in Perl. When I fill up a big hash, I can not get the memory to be released back to the OS. When I do the same with a scalar and use undef, it will give the memory back to the OS.
Here is a test program I wrote.
#!/usr/bin/perl
###### Memory test
######
## Use Commands
use Number::Bytes::Human qw(format_bytes);
use Data::Dumper;
use Devel::Size qw(size total_size);
## Create Varable
my $share_var;
my %share_hash;
my $type_hash = 1;
my $type_scalar = 1;
## Start Main Loop
while (true) {
&Memory_Check();
print "Hit Enter (add to memory): "; <>;
&Up_Mem(100_000);
&Memory_Check();
print "Hit Enter (Set Varable to nothing): "; <>;
$share_var = "";
$share_hash = ();
&Memory_Check();
print "Hit Enter (clean data): "; <>;
&Clean_Data();
&Memory_Check();
print "Hit Enter (start over): "; <>;
}
exit;
#### Up Memory
sub Up_Mem {
my $total_loops = shift;
my $n = 1;
print "Adding data to shared varable $total_loops times\n";
until ($n > $total_loops) {
if ($type_hash) {
$share_hash{$n} = 'X' x 1111;
}
if ($type_scalar) {
$share_var .= 'X' x 1111;
}
$n += 1;
}
print "Done Adding Data\n";
}
#### Clean up Data
sub Clean_Data {
print "Clean Up Data\n";
if ($type_hash) {
## Method to fix hash (Trying Everything i can think of!
my $n = 1;
my $total_loops = 100_000;
until ($n > $total_loops) {
undef $share_hash{$n};
$n += 1;
}
%share_hash = ();
$share_hash = ();
undef $share_hash;
undef %share_hash;
}
if ($type_scalar) {
undef $share_var;
}
}
#### Check Memory Usage
sub Memory_Check {
## Get current memory from shell
my #mem = `ps aux | grep \"$$\"`;
my($results) = grep !/grep/, #mem;
## Parse Data from Shell
chomp $results;
$results =~ s/^\w*\s*\d*\s*\d*\.\d*\s*\d*\.\d*\s*//g; $results =~ s/pts.*//g;
my ($vsz,$rss) = split(/\s+/,$results);
## Format Numbers to Human Readable
my $h = Number::Bytes::Human->new();
my $virt = $h->format($vsz);
my $h = Number::Bytes::Human->new();
my $res = $h->format($rss);
print "Current Memory Usage: Virt: $virt RES: $res\n";
if ($type_hash) {
my $total_size = total_size(\%share_hash);
my #arr_c = keys %share_hash;
print "Length of Hash: " . ($#arr_c + 1) . " Hash Mem Total Size: $total_size\n";
}
if ($type_scalar) {
my $total_size = total_size($share_var);
print "Length of Scalar: " . length($share_var) . " Scalar Mem Total Size: $total_size\n";
}
}
OUTPUT:
./Memory_Undef_Simple.cgi
Current Memory Usage: Virt: 6.9K RES: 2.7K
Length of Hash: 0 Hash Mem Total Size: 92
Length of Scalar: 0 Scalar Mem Total Size: 12
Hit Enter (add to memory):
Adding data to shared varable 100000 times
Done Adding Data
Current Memory Usage: Virt: 228K RES: 224K
Length of Hash: 100000 Hash Mem Total Size: 116813243
Length of Scalar: 111100000 Scalar Mem Total Size: 111100028
Hit Enter (Set Varable to nothing):
Current Memory Usage: Virt: 228K RES: 224K
Length of Hash: 100000 Hash Mem Total Size: 116813243
Length of Scalar: 0 Scalar Mem Total Size: 111100028
Hit Enter (clean data):
Clean Up Data
Current Memory Usage: Virt: 139K RES: 135K
Length of Hash: 0 Hash Mem Total Size: 92
Length of Scalar: 0 Scalar Mem Total Size: 24
Hit Enter (start over):
So as you can see the memory goes down, but it only goes down the size of the scalar. Any ideas how to free the memory of the hash?
Also Devel::Size shows the hash is only taking up 92 bytes even though the program still is using 139K.

Generally, yeah, that's how memory management on UNIX works. If you are using Linux with a recent glibc, and are using that malloc, you can return free'd memory to the OS. I am not sure Perl does this, though.
If you want to work with large datasets, don't load the whole thing into memory, use something like BerkeleyDB:
https://metacpan.org/pod/BerkeleyDB
Example code, stolen verbatim:
use strict ;
use BerkeleyDB ;
my $filename = "fruit" ;
unlink $filename ;
tie my %h, "BerkeleyDB::Hash",
-Filename => $filename,
-Flags => DB_CREATE
or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;
# Add a few key/value pairs to the file
$h{apple} = "red" ;
$h{orange} = "orange" ;
$h{banana} = "yellow" ;
$h{tomato} = "red" ;
# Check for existence of a key
print "Banana Exists\n\n" if $h{banana} ;
# Delete a key/value pair.
delete $h{apple} ;
# print the contents of the file
while (my ($k, $v) = each %h)
{ print "$k -> $v\n" }
untie %h ;
(OK, not verbatim. Their use of use vars is ... legacy ...)
You can store gigabytes of data in a hash this way, and you will only use a tiny bit of memory. (Basically, whatever BDB's pager decides to keep in memory; this is controllable.)

In general, you cannot expect perl to release memory to the OS.
See the FAQ: How can I free an array or hash so my program shrinks?.
You usually can't. Memory allocated to lexicals (i.e. my() variables) cannot be reclaimed or reused even if they go out of scope. It is reserved in case the variables come back into scope. Memory allocated to global variables can be reused (within your program) by using undef() and/or delete().
On most operating systems, memory allocated to a program can never be returned to the system. That's why long-running programs sometimes re- exec themselves. Some operating systems (notably, systems that use mmap(2) for allocating large chunks of memory) can reclaim memory that is no longer used, but on such systems, perl must be configured and compiled to use the OS's malloc, not perl's.
It is always a good idea to read the FAQ list, also installed on your computer, before wasting your time.
For example, How can I make my Perl program take less memory? is probably relevant to your issue.

Why do you want Perl to release the memory to the OS? You could just use a larger swap.
If you really must, do your work in a forked process, then exit.

Try recompiling perl with the option -Uusemymalloc to use the system malloc and free. You might see some different results

Related

How can I remove perl object from memory

I'm having some issues with the memory usage of a perl script I wrote (code below). The script initiates some variables, fills them with data, and then undefines them again. However, the memory usage of the script after deleting everything is still way to high to contain no data.
Accoring to ps the script uses 1.027 Mb memory (RSS) during the first 39 seconds (so everything before the foreach loop). Then, memory usage starts rising and ends up fluctuating between 204.391 Mb and 172.410 Mb. However, even in the last 10 seconds of the script (where all data is supposed to be removed), memory usage never goes below 172.410 Mb.
Is there a way to permanently delete a variable and all data in it in perl (in order to reduce the memory usage of the script)? If so, how should I do it?
use strict;
use warnings;
sleep(30);
my $ELEMENTS = 1_000_000;
my $MAX_ELEMENT = 1_000_000_000;
my $if_condition = 1;
sleep(5);
my %hash = (1 => {}, 2 => {}, 3 => {}, 4 => {});
foreach my $key (keys %hash){
if( $if_condition ){
my $arrref1 = [ (rand($MAX_ELEMENT)) x $ELEMENTS ];
my $arrref2 = [ (rand($MAX_ELEMENT)) x $ELEMENTS ];
my $arrref3 = [ (rand($MAX_ELEMENT)) x $ELEMENTS ];
sleep(2);
if(!defined($hash{$key}->{'amplification'})){
$hash{$key}->{'amplification'} = [];
}
push(#{$hash{$key}->{'amplification'}},#{$arrref1});
undef($arrref1);
push(#{$hash{$key}->{'amplification'}},#{$arrref2});
undef($arrref2);
push(#{$hash{$key}->{'amplification'}},#{$arrref3});
undef($arrref3);
sleep(3);
delete($hash{$key});
sleep(5);
}
}
sleep(10);
Perl FAQ 3 - How can I free an array or hash so my program shrinks?
You usually can't. Memory allocated to lexicals (i.e. my() variables)
cannot be reclaimed or reused even if they go out of scope. It is
reserved in case the variables come back into scope. Memory allocated
to global variables can be reused (within your program) by using
undef() and/or delete().
On most operating systems, memory allocated
to a program can never be returned to the system. That's why
long-running programs sometimes re- exec themselves. Some operating
systems (notably, systems that use mmap(2) for allocating large chunks
of memory) can reclaim memory that is no longer used, but on such
systems, perl must be configured and compiled to use the OS's malloc,
not perl's.
In general, memory allocation and de-allocation isn't
something you can or should be worrying about much in Perl.
See also
"How can I make my Perl program take less memory?"
In general, perl won't release memory back to the system. It keeps its own pool of memory in case it is required for another purpose. This happens a lot because lexical data is often used in a loop, for instance your $arrref1 variables refer to a million-element array. If the memory for those arrays was returned to the system and reallocated every time around the loop there would be an enormous speed penalty
As I wrote, 170MB isn't a lot, but you can reduce the footprint by dropping your big temporary arrays and adding the list directly to the hash element. As it stands you are unnecessarily keeping two copies of each array
It would look like this
use strict;
use warnings 'all';
sleep 30;
use constant ELEMENTS => 1_000_000;
use constant MAX_ELEMENT => 1_000_000_000;
my $if_condition = 1;
sleep 5;
my %hash = ( 1 => {}, 2 => {}, 3 => {}, 4 => {} );
foreach my $key ( keys %hash ) {
next unless $if_condition;
sleep 2;
push #{ $hash{$key}{amplification} }, (rand MAX_ELEMENT) x ELEMENTS;
push #{ $hash{$key}{amplification} }, (rand MAX_ELEMENT) x ELEMENTS;
push #{ $hash{$key}{amplification} }, (rand MAX_ELEMENT) x ELEMENTS;
sleep 3;
delete $hash{$key};
sleep 5;
}
sleep 10;

Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.
To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb
Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:
abcde => 1
Next line, it moves one character to the right and does the same:
bcdef => 1
It then continues for the rest of the string and prints the 5 most frequent sequences:
open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'
my $seq = <$in>; # read whole file into string
my $len = length($seq);
my $seq_length = 5; # set k-mer length
my %data;
for (my $i = 0; $i <= $len - $seq_length; $i++) {
my $kmer = substr($seq, $i, $seq_length);
$data{$kmer}++;
}
# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
print "$kmer $data{$kmer}\n";
$count++;
last if $count >= 5;
}
ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1
However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.
I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.
My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.
Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?
From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once
This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory
The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer
After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file
I've also use constants for the K-mer size and the chunk size. They are constant after all!
The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort
use strict;
use warnings 'all';
use constant SEQ_LENGTH => 5; # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB
my $in_file = shift // 'in.txt';
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};
my %data;
my $chunk;
my $length = 0;
while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {
$length += $size;
for my $offset ( 0 .. $length - SEQ_LENGTH ) {
my $kmer = substr $chunk, $offset, SEQ_LENGTH;
++$data{$kmer};
}
$chunk = substr $chunk, -(SEQ_LENGTH-1);
$length = length $chunk;
}
my #kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for #kmers[0..4];
output
ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1
Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.
The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.
Even if you don't read the entire file into memory before processing it, you could still run out of memory.
A 10 GiB file contains almost 11E9 sequences.
If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.
If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.
In that case, I suggest doing the following:
Create a file that contains all the sequences of the original file.
The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.
perl -e'
use strict;
use warnings qw( all );
use constant SEQ_LENGTH => 20;
use constant CHUNK_SIZE => 1024 * 1024;
my $buf = "";
while (1) {
my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
die($!) if !defined($size);
last if !$size;
for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
print(substr($buf, $offset, SEQ_LENGTH), "\n");
}
substr($buf, 0, -(SEQ_LENGTH-1), "");
}
' <in.txt >sequences.txt
Sort the sequences.
sort sequences.txt >sorted_sequences.txt
Count the number of instances of each sequeunces, and store the count along with the sequences in another file.
perl -e'
use strict;
use warnings qw( all );
my $last = "";
my $count;
while (<>) {
chomp;
if ($_ eq $last) {
++$count;
} else {
print("$count $last\n") if $count;
$last = $_;
$count = 1;
}
}
' sorted_sequences.txt >counted_sequences.txt
Sort the sequences by count.
sort -rns counted_sequences.txt >sorted_counted_sequences.txt
Extract the results.
perl -e'
use strict;
use warnings qw( all );
my $last_count;
while (<>) {
my ($count, $seq) = split;
last if $. > 5 && $count != $last_count;
print("$seq $count\n");
$last_count = $count;
}
' sorted_counted_sequences.txt
This also prints ties for 5th place.
This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.
sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.
Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.
Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.
So if you can turn this into a regex-based solution that's much better.
Here's an attempt to do that:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
3 ebfeb
2 febfe
2 bfebf
1 gbacb
1 fabcg
I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.
I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.
A slightly altered implementation that does this all in perl:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca
Pretty formatting for the code behind that:
my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
substr($str, 0, 1) = "" if $pos;
$occur{$_}++ for $str =~ m/(.{5})/gs;
}
for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
say "$occur{$k} $k";
}
The most straightforward approach is to use the substr() function:
% time perl -e '$/ = \1048576;
while ($s = <>) { for $i (0..length $s) {
$hash{ substr($s, $i, 5) }++ } }
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
269.79 real 268.92 user 0.66 sys
The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from #Jonathan Leffler, #ÆvarArnfjörðBjarmason, #Vorsprung, #ThisSuitIsBlackNotm #borodin and #ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.
As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. #ikegami's and #Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both #borodin and #ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.
Using Stream::Reader
Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.
use Stream::Reader;
use strict;
use warnings;
open( my $handler, "<", shift );
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } );
my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) {
foreach my $i (0..length $string) {
$hash{ substr($string, $i, 5) }++
}
}
my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.
NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210
The Stream::Reader based script was significantly faster:
time perl sequence_search_stream-reader.pl nucleotide.data
252.12s
time perl sequence_search_borodin.pl nucleotide.data
350.57s
The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:
% wc nucleotide.data
0 0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA
I used this command to create the file:
perl -MString::Random=random_regex -e '
open (my $fh, ">>", "nucleotide.data");
for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'
Lists and Strings
Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.
I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:
use List::Gen 'slide';
$/ = \1048575; # Read a million character/bytes at a time.
my %hash;
while (my $seq = <>) {
chomp $seq;
foreach my $kmer (slide { join("", #_) } 5 => split //, $seq) {
next unless length $kmer == 5;
$hash{$kmer}++;
}
}
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from #Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).
Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.
Postscript and Results
The List::Gen approach was consistently between 5 and 6 times slower than #Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:
1 million character nucleotide string
sequence_search_stream-reader.pl : 0.26s
sequence_search_borodin.pl : 0.39s
sequence_search_listgen.pl : 2.04s
83 million character nucleotide string
With the data in file xaa:
wc xaa
0 1 83886080 xaa
% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
21.33 real 20.95 user 0.35 sys
% time perl sequence_search_borodin.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
28.13 real 28.08 user 0.03 sys
% time perl sequence_search_listgen.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
157.54 real 156.93 user 0.45 sys
1 billion character nucleotide string
In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.
sequence_search_stream-reader.pl : 252.12s
sequence_search_borodin.pl : 350.57s
sequence_search_listgen.pl : 1928.34s
The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.
I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.
1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.
echo "scale=10; 100 * (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

Perl memory allocation

The following simple C code allocates abouts 1.6% of my computer memory and completes in less than 2 seconds:
main()
{
int i = 0;
char *array = malloc(64000000);
for (i = 0; i < 64000000; i++) {
array[i] = i % 256;
}
getchar();
}
How can I do a similar thing in Perl?
The following Perl code consumes about 70% of my computer memory (At which I kill it)
my #array;
for(my $i=0;$i<64000000;$i++)
{
$array[$i]=1;
}
getc();
exit;
How do I malloc in Perl ?
You allocated an array of 64,000,000 SV* plus 64,000,000 scalars. The array alone is already 8 times the size of what you allocated in your C program. That's not counting any of the 64,000,000 scalars or the overhead of allocating 64,000,000 memory blocks.
To allocate 64,000,000 bytes, you can use the following:
my $s = "\0" x 64_000_000;
However, that place two copies in memory.[1] The following doesn't.
use Fcntl qw( SEEK_SET );
my $s;
{
open my $fh, '>', \$s;
seek($fh, 64_000_000-1, SEEK_SET);
print $fh "\0";
}
pack+substr can be used to store a number, and substr+unpack can be used to extract a number.
Finally, rather than dealing with packed numbers, you could use PDL.
Technically, it only places one copy into memory, and it does so at compile-time. Thanks to the copy-on-write (COW) mechanism, the assignment simply causes $s to share the buffer of the constant. But, I presume you intend to modify the buffer in $s, which would require making a writable copy of its buffer.
You are seeing the difference in variable sizes between languages.
See http://perlmaven.com/how-much-memory-do-perl-variables-use
This also has a good explanation of memory usage:
http://search.cpan.org/~nwclark/Devel-Size-0.79/lib/Devel/Size.pm
In short, your perl array will need at least 1536 MB of space to store that array.

Is there a way to record maximum memory used running in PERL

As I said, I want to record the max memory used during program run time.
Devel::Size only measures the memory size one at a time for one particular data structure to measure the total memory size of all data structure used in your script On Unix-like systems, Proc::ProcessTable provides a nice API for it:
Here is a simple script, comparing it with Devel::Size.
#!/usr/bin/perl -w
use strict;
use Proc::ProcessTable;
use Devel::Size qw(size);
my #arr = ('A' .. 'M');
my $devel_size = size(\#arr);
print "With DEVEL::SIZE I'm $devel_size bytes big\n";
my $t = Proc::ProcessTable->new();
foreach my $p ( #{$t->table} ) {
if($p->pid() == $$) {
print "With Proc::ProcessTable I'm ", $p->size(), " bytes big.\n";
last;
}
}
It gives:
With DEVEL::SIZE I'm 104 bytes big.
With Proc::ProcessTable I'm 5357568 bytes big.
Note: source of Info: http://www.perlmonks.org/
Found this on perlmonks, http://www.perlmonks.org/?node_id=498401:
use Devel::Size qw/ total_size /;
print total_size( \%:: );

How do I get the size of a file in megabytes using Perl?

I want to get the size of a file on disk in megabytes. Using the -s operator gives me the size in bytes, but I'm going to assume that then dividing this by a magic number is a bad idea:
my $size_in_mb = (-s $fh) / (1024 * 1024);
Should I just use a read-only variable to define 1024 or is there a programmatic way to obtain the amount of bytes in a kilobyte?
EDIT: Updated the incorrect calculation.
If you'd like to avoid magic numbers, try the CPAN module Number::Bytes::Human.
use Number::Bytes::Human qw(format_bytes);
my $size = format_bytes(-s $file); # 4.5M
This is an old question and has been already correctly answered, but just in case your program is constrained to the core modules and you can not use Number::Bytes::Human here you have several other options I have been collected over time. I have kept them also because each one use a different Perl approach and is a nice example for TIMTOWTDI:
example 1: uses state to avoid reinitialize the variable each time (before perl 5.16 you need to use feature state or perl -E)
http://kba49.wordpress.com/2013/02/17/format-file-sizes-human-readable-in-perl/
sub formatSize {
my $size = shift;
my $exp = 0;
state $units = [qw(B KB MB GB TB PB)];
for (#$units) {
last if $size < 1024;
$size /= 1024;
$exp++;
}
return wantarray ? ($size, $units->[$exp]) : sprintf("%.2f %s", $size, $units->[$exp]);
}
example 2: using sort map
.
sub scaledbytes {
# http://www.perlmonks.org/?node_id=378580
(sort { length $a <=> length $b
} map { sprintf '%.3g%s', $_[0]/1024**$_->[1], $_->[0]
}[" bytes"=>0]
,[KB=>1]
,[MB=>2]
,[GB=>3]
,[TB=>4]
,[PB=>5]
,[EB=>6]
)[0]
}
example 3: Take advantage of the fact that 1 Gb = 1024 Mb, 1 Mb = 1024 Kb and 1024 = 2 ** 10:
.
# http://www.perlmonks.org/?node_id=378544
my $kb = 1024 * 1024; # set to 1 Gb
my $mb = $kb >> 10;
my $gb = $mb >> 10;
print "$kb kb = $mb mb = $gb gb\n";
__END__
1048576 kb = 1024 mb = 1 gb
example 4: use of ++$n and ... until .. to obtain an index for the array
.
# http://www.perlmonks.org/?node_id=378542
#! perl -slw
use strict;
sub scaleIt {
my( $size, $n ) =( shift, 0 );
++$n and $size /= 1024 until $size < 1024;
return sprintf "%.2f %s",
$size, ( qw[ bytes KB MB GB ] )[ $n ];
}
my $size = -s $ARGV[ 0 ];
print "$ARGV[ 0 ]: ", scaleIt $size;
Even if you can not use Number::Bytes::Human, take a look at the source code to see all the things that you need to be aware of.
You could of course create a function for calculating this. That is a better solution than creating constants in this instance.
sub size_in_mb {
my $size_in_bytes = shift;
return $size_in_bytes / (1024 * 1024);
}
No need for constants. Changing the 1024 to some kind of variable/constant won't make this code more readable.
Well, there's not 1024 bytes in a meg, there's 1024 bytes in a K, and 1024 K in a meg...
That said, 1024 is a safe "magic" number that will never change in any system you can expect your program to work in.
I would read this into a variable rather than use a magic number. Even if magic numbers are not going to change, like the number of bytes in a megabyte, using a well named constant is a good practice because it makes your code more readable. It makes it immediately apparent to everybody else what your intention is.
1) You don't want 1024. That gives you kilobytes. You want 1024*1024, or 1048576.
2) Why would dividing by a magic number be a bad idea? It's not like the number of bytes in a megabyte will ever change. Don't overthink things too much.
Don't get me wrong, but: I think that declaring 1024 as a Magic Variable goes a bit too far, that's a bit like "$ONE = 1; $TWO = 2;" etc.
A Kilobyte has been falsely declared as 1024 Bytes since more than 20 years, and I seriously doubt that the operating system manufacturers will ever correct that bug and change it to 1000.
What could make sense though is to declare non-obvious stuff, like "$megabyte = 1024 * 1024" since that is more readable than 1048576.
Since the -s operator returns the file size in bytes you should probably be doing something like
my $size_in_mb = (-s $fh) / (1024 * 1024);
and use int() if you need a round figure. It's not like the dimensions of KB or MB is going to change anytime in the near future :)