TCL TK Progress Bar for large file transfer: too much time - copy

I suggested this solution to a colleague that needed to implement a progress bar for file copy. Here's the code, basically a modified version of another example I found:
proc CopyMore {in out chunk bytes {error {}}} {
global total done transferredBytes totbytes
incr total $bytes
incr transferredBytes $bytes
if {([string length $error] != 0) || [eof $in]} {
set perc [expr int($transferredBytes*100/$totbytes)]
updprog $perc
puts "$transferredBytes $totbytes"
puts $total
set done $total
close $in
close $out
} else {
set perc [expr int($transferredBytes*100/$totbytes)]
updprog $perc
puts "$transferredBytes $totbytes"
fcopy $in $out -command [list CopyMore $in $out $chunk] -size $chunk
}
}
proc Copy {in out chunk} {
global total done
set total 0
set start [clock seconds]
fcopy $in $out -command [list CopyMore $in $out $chunk] -size $chunk
vwait done
set end [clock seconds]
return [expr $end-$start]
}
proc CopyFile {source dest} {
set chunk [expr 1024*1024*18]
set in [open $source]
set out [open $dest w]
fconfigure $in -translation binary
fconfigure $out -translation binary
retur[Copy $in $out $chunk]
}
The CopyFile function is called for every file I need to copy. The fconfigure $in -translation binary seems needed here, because I'm handling binary files (already tried to remove this instruction and just a few bytes were actually copied).
I supposed that this copy was designed for large files, but I noticed that the copy process is extremely slower compared to the previous solution (without chunks, progressbar, etc). Here some numbers to copy about 1.3 GB:
Method
Time (secs)
Previous solution
9
Chunk 256K
178
Chunk 1M
143
Chunk 18M
127
Chunk 100M
122
Chunk 300M
121
So it seems that increasing the chunk size slightly improves the performance but all these results are very far from the previous solution.
I can't explain such a difference. Do you guys have the same issue?
Thanks in advance for your help.
UPDATE! I added the following instructions in the CopyFile function:
fconfigure $in -buffersize 1000000
fconfigure $out -buffersize 1000000
and now this new copy tecnhique is about 20% slower than the previous one.
Do you guys have any suggestion to reduce more such time difference?

Related

Randomly pick a region and process it, a number of times

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to randomly pick up a region with 10 letters from it then calculate the number of F, I want to do that for a certain number of times for example 1000 times or even more
as an example, I randomly pick
LVPSLTRYLT 0
then
ITNLRSFIHK 1
then again randomly go and pick up 10 letters consecutive
AHSRIRKERP 0
This continues until it meets the number of run asked. I want to store all randomly selected ones with their values, because then I want to calculate how many times F is seen
So I do the following
# first I remove the header
grep -v ">" data.txt > out.txt
then get randomly one region with 10 letters I tried to use shuf with no success,
shuf -n1000 data.txt
then I tried to use awk and was not successful either
awk 'BEGIN {srand()} !/^$/ { if (rand() == 10) print $0}'
then calculate the number of F and save it in the file
grep -i -e [F] |wc -l
Note, we should not pick up the same region twice
I've got to assume some things here, and leave some restrictions
Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code
First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::MoreUtils qw(uniq);
my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);
GetOptions(
'num-regions|n=i' => \$num_regions,
'region-len|l=i' => \$region_len,
'char|c=s' => \$count_freq_for,
) or usage();
my $file = shift || usage();
# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my #rand = uniq sort { $a <=> $b }
map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: #rand";
my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0);
my $region;
while (my $line = <$fh>) {
chomp $line;
# Total number of characters so far, up to this line and with this line
$nchars_prev = $nchars;
$nchars += length $line;
next if $line =~ /^\s*>sp/;
# Complete the region if there wasn't enough chars on the previous line
if ($chars_left > 0) {
$region .= substr $line, 0, $chars_left;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
$chars_left = -1;
};
# Random positions that happen to be on this line
my #pos = grep { $_ > $nchars_prev and $_ < $nchars } #rand;
# say "\tPositions on ($nchars_prev -- $nchars) line: #pos" if #pos;
for (#pos) {
my $pos_in_line = $_ - $nchars_prev;
$region = substr $line, $pos_in_line, $region_len;
# Don't print if there aren't enough chars left on this line
last if ( $chars_left =
($region_len - (length($line) - $pos_in_line)) ) > 0;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
}
}
sub fraction_skipped {
my ($fh) = #_;
my ($skip_len, $data_len);
my $curr_pos = tell $fh;
seek $fh, 0, 0 if $curr_pos != 0;
while (<$fh>) {
chomp;
if (/^\s*>sp/) { $skip_len += length }
else { $data_len += length }
}
seek $fh, $curr_pos, 0; # leave it as we found it
return $skip_len / ($skip_len+$data_len);
}
sub usage {
say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
exit;
}
Uncomment the srand line so to have the same run always, for testing.
Notes follow.
Some corner cases
If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little
Handling of issues regarding randomness
"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.
Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.
If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.
The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.
More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.
In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.
Note on a particular test case
With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.
Here is a simple driver to run the above a given number of times, for statistics.
Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.†
Adjust and add code to process as needed for statistics; output is in files.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use IPC::Run qw(run);
my $outdir = 'rr_output'; # pick a directory name
mkdir $outdir if not -d $outdir;
my $prog = 'random_regions.pl'; # your name for the program
my $input = 'data_file.txt'; # your name for input file
my $ch = 'F';
my ($runs, $regions, $len) = (10, 10, 10);
GetOptions(
'runs|n=i' => \$runs,
'regions=i' => \$regions,
'length=i' => \$len,
'char=s' => \$ch,
'input=s' => \$input
) or usage();
my #cmd = ( $prog, $input,
'--num-regions', $regions,
'--region-len', $len,
'--char', $ch
);
say "Run: #cmd, $runs times.";
for my $n (1..$runs) {
my $outfile = "$outdir/regions_r$n.txt";
say "Run #$n, output in: $outdir/$outfile";
run \#cmd, '>', $outfile or die "Error with #cmd: $!";
}
sub usage {
say STDERR "Usage: $0 [options]", "\n\toptions: ...";
exit;
}
Please expand on the error checking. See for instance this post and links on details.
Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.
The called program (random_regions.pl above) must be executable.
†   Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.
awk to the rescue!
you didn't specify but there are two random actions going on. I treated them independently, may not be so. First picking a line and second picking a random 10 letter substring from that line.
This assumes the file (or actually half of it) can fit in memory. Otherwise, split the file into equal chunks and run this on chunks. Doing so will reduce some of the clustering but not sure how important in this case. (If you have one big file, it's possible that all samples may be drawn from the first half, with splitting you eliminate this probability). For certain cases this is a desired property. Don't know your case.
$ awk 'BEGIN {srand()}
!/^>/ {a[++n]=$0}
END {while(i++<1000)
{line=a[int(rand()*n)+1];
s=int(rand()*(length(line)-9));
print ss=substr(line,s,10), gsub(/F/,"",ss)}}' file
GERISPKDRC 0
QDEPRCLHEW 0
LLYQLFRNLF 2
GTHGAGAMES 0
TKALQDVQIR 0
FCVHTKALQD 1
SNKAQVKPGQ 0
CMECQGHGER 0
TRRFVGHTKD 1
...
Here is one solution using Perl
It slurps the entire file to memory. Then the lines starting with > are removed.
Here I'm looping for 10 times $i<10, you can increase the count here.
Then rand function is called by passing length of the file and using the rand value, substr of 10 is computed. $s!~/\n/ guard is to make sure we don't choose the substring crossing newlines.
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f\n" ;$i++} else { $i-- } } '
random10.txt
ENTQLLETKN 0
LSEGALSPDG 0
LRKARAEAED 0
RLWDLTTGTT 0
KWSGRCGLGY 0
TRRFVGHTKD 1
PVKRPIPHPA 0
GMVQQIQSVC 0
LTHPVLSFGI 1
KVNFPENGFL 2
$
To know the random number generated
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f $x\n" ;$i++} else { $i-- } }
' random10.txt
QLDGSLTMSS 0 1378.61409368207
DLIAKVDELT 0 1703.46689004765
SGGGANGTSF 1 900.269562152326
PEELTLSPKL 0 1368.55540468164
TCLSEGALSP 0 1016.50744004085
NRTWNSSAVP 0 23.7868578293154
VNFPENGFLS 2 363.527933104776
NSGLTWSGND 0 48.656607650744
MILSASRDKT 0 422.67705815168
RRGEDLFMCM 1 290.828530365
AGDGLLTPDA 0 1481.78080339531
$
Since your input file is huge I'd do it in these steps:
select random 10-char strings from each line of your input file
shuffle those to get the number of samples you want in random order
count the Fs
e.g.
$ cat tst.sh
#!/bin/env bash
infile="$1"
sampleSize=10
numSamples=15
awk -v sampleSize="$sampleSize" '
BEGIN { srand() }
!/^>/ {
begPos = int((rand() * sampleSize) + 1)
endPos = length($0) - sampleSize
for (i=begPos; i<=endPos; i+=sampleSize) {
print substr($0,i,sampleSize)
}
}
' "$infile" |
shuf -n "$numSamples"
.
$ ./tst.sh file
HGDIKCVLNE
QDEPRCLHEW
SEVQAIIEST
THDLRVSLEE
SEWVSCVRFS
LTRYLTLNAS
KDGQKITFHG
SNSPEPQKAV
QGGSKATTPA
QLLETKNALN
LLFCDNHKKQ
DETNYGIPQR
IRFQPQLNPD
LQTIRFSPDI
SLKRCGGFLI
$ ./tst.sh file | awk '{print $0, gsub(/F/,"")}'
SPKLQLDGSL 0
IKLFCVHTKA 1
VVSRCRLRHT 0
SPEPQKAVEQ 0
AYNPKNFSND 1
FGESRPELGS 1
AGDGLLTPDA 0
VGHTKDVLSV 0
VTHDLRVSLE 0
PISLGIFPLP 1
ASQITNLRSF 1
LTRPPEELTL 0
FDRYGEEGLK 1
IYIEGQDEPR 0
WNTLGVCKYT 0
Just change the numSamples from 15 to 1000 or whatever you like when run against your real data.
The above relies on shuf -n being able to handle however much input we throw at it, presumably much like sort does by using paging. If it fails in that regard then obviously you'd have to choose/implement a different tool for that part. FWIW I tried seq 100000000 | shuf -n 10000 (i.e. 10 times as many input lines as the OPs posted max file length of 10000000 to account for the awk part generating N lines of output per 1 line of input and 10 times as many output lines required than the OPs posted 1000) and it worked fine and only took a few secs to complete.

Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.
To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb
Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:
abcde => 1
Next line, it moves one character to the right and does the same:
bcdef => 1
It then continues for the rest of the string and prints the 5 most frequent sequences:
open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'
my $seq = <$in>; # read whole file into string
my $len = length($seq);
my $seq_length = 5; # set k-mer length
my %data;
for (my $i = 0; $i <= $len - $seq_length; $i++) {
my $kmer = substr($seq, $i, $seq_length);
$data{$kmer}++;
}
# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
print "$kmer $data{$kmer}\n";
$count++;
last if $count >= 5;
}
ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1
However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.
I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.
My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.
Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?
From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once
This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory
The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer
After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file
I've also use constants for the K-mer size and the chunk size. They are constant after all!
The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort
use strict;
use warnings 'all';
use constant SEQ_LENGTH => 5; # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB
my $in_file = shift // 'in.txt';
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};
my %data;
my $chunk;
my $length = 0;
while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {
$length += $size;
for my $offset ( 0 .. $length - SEQ_LENGTH ) {
my $kmer = substr $chunk, $offset, SEQ_LENGTH;
++$data{$kmer};
}
$chunk = substr $chunk, -(SEQ_LENGTH-1);
$length = length $chunk;
}
my #kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for #kmers[0..4];
output
ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1
Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.
The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.
Even if you don't read the entire file into memory before processing it, you could still run out of memory.
A 10 GiB file contains almost 11E9 sequences.
If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.
If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.
In that case, I suggest doing the following:
Create a file that contains all the sequences of the original file.
The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.
perl -e'
use strict;
use warnings qw( all );
use constant SEQ_LENGTH => 20;
use constant CHUNK_SIZE => 1024 * 1024;
my $buf = "";
while (1) {
my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
die($!) if !defined($size);
last if !$size;
for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
print(substr($buf, $offset, SEQ_LENGTH), "\n");
}
substr($buf, 0, -(SEQ_LENGTH-1), "");
}
' <in.txt >sequences.txt
Sort the sequences.
sort sequences.txt >sorted_sequences.txt
Count the number of instances of each sequeunces, and store the count along with the sequences in another file.
perl -e'
use strict;
use warnings qw( all );
my $last = "";
my $count;
while (<>) {
chomp;
if ($_ eq $last) {
++$count;
} else {
print("$count $last\n") if $count;
$last = $_;
$count = 1;
}
}
' sorted_sequences.txt >counted_sequences.txt
Sort the sequences by count.
sort -rns counted_sequences.txt >sorted_counted_sequences.txt
Extract the results.
perl -e'
use strict;
use warnings qw( all );
my $last_count;
while (<>) {
my ($count, $seq) = split;
last if $. > 5 && $count != $last_count;
print("$seq $count\n");
$last_count = $count;
}
' sorted_counted_sequences.txt
This also prints ties for 5th place.
This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.
sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.
Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.
Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.
So if you can turn this into a regex-based solution that's much better.
Here's an attempt to do that:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
3 ebfeb
2 febfe
2 bfebf
1 gbacb
1 fabcg
I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.
I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.
A slightly altered implementation that does this all in perl:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca
Pretty formatting for the code behind that:
my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
substr($str, 0, 1) = "" if $pos;
$occur{$_}++ for $str =~ m/(.{5})/gs;
}
for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
say "$occur{$k} $k";
}
The most straightforward approach is to use the substr() function:
% time perl -e '$/ = \1048576;
while ($s = <>) { for $i (0..length $s) {
$hash{ substr($s, $i, 5) }++ } }
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
269.79 real 268.92 user 0.66 sys
The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from #Jonathan Leffler, #ÆvarArnfjörðBjarmason, #Vorsprung, #ThisSuitIsBlackNotm #borodin and #ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.
As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. #ikegami's and #Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both #borodin and #ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.
Using Stream::Reader
Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.
use Stream::Reader;
use strict;
use warnings;
open( my $handler, "<", shift );
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } );
my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) {
foreach my $i (0..length $string) {
$hash{ substr($string, $i, 5) }++
}
}
my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.
NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210
The Stream::Reader based script was significantly faster:
time perl sequence_search_stream-reader.pl nucleotide.data
252.12s
time perl sequence_search_borodin.pl nucleotide.data
350.57s
The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:
% wc nucleotide.data
0 0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA
I used this command to create the file:
perl -MString::Random=random_regex -e '
open (my $fh, ">>", "nucleotide.data");
for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'
Lists and Strings
Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.
I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:
use List::Gen 'slide';
$/ = \1048575; # Read a million character/bytes at a time.
my %hash;
while (my $seq = <>) {
chomp $seq;
foreach my $kmer (slide { join("", #_) } 5 => split //, $seq) {
next unless length $kmer == 5;
$hash{$kmer}++;
}
}
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from #Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).
Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.
Postscript and Results
The List::Gen approach was consistently between 5 and 6 times slower than #Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:
1 million character nucleotide string
sequence_search_stream-reader.pl : 0.26s
sequence_search_borodin.pl : 0.39s
sequence_search_listgen.pl : 2.04s
83 million character nucleotide string
With the data in file xaa:
wc xaa
0 1 83886080 xaa
% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
21.33 real 20.95 user 0.35 sys
% time perl sequence_search_borodin.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
28.13 real 28.08 user 0.03 sys
% time perl sequence_search_listgen.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
157.54 real 156.93 user 0.45 sys
1 billion character nucleotide string
In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.
sequence_search_stream-reader.pl : 252.12s
sequence_search_borodin.pl : 350.57s
sequence_search_listgen.pl : 1928.34s
The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.
I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.
1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.
echo "scale=10; 100 * (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

In Perl, how can I release memory to the operating system?

I am having some problems with memory in Perl. When I fill up a big hash, I can not get the memory to be released back to the OS. When I do the same with a scalar and use undef, it will give the memory back to the OS.
Here is a test program I wrote.
#!/usr/bin/perl
###### Memory test
######
## Use Commands
use Number::Bytes::Human qw(format_bytes);
use Data::Dumper;
use Devel::Size qw(size total_size);
## Create Varable
my $share_var;
my %share_hash;
my $type_hash = 1;
my $type_scalar = 1;
## Start Main Loop
while (true) {
&Memory_Check();
print "Hit Enter (add to memory): "; <>;
&Up_Mem(100_000);
&Memory_Check();
print "Hit Enter (Set Varable to nothing): "; <>;
$share_var = "";
$share_hash = ();
&Memory_Check();
print "Hit Enter (clean data): "; <>;
&Clean_Data();
&Memory_Check();
print "Hit Enter (start over): "; <>;
}
exit;
#### Up Memory
sub Up_Mem {
my $total_loops = shift;
my $n = 1;
print "Adding data to shared varable $total_loops times\n";
until ($n > $total_loops) {
if ($type_hash) {
$share_hash{$n} = 'X' x 1111;
}
if ($type_scalar) {
$share_var .= 'X' x 1111;
}
$n += 1;
}
print "Done Adding Data\n";
}
#### Clean up Data
sub Clean_Data {
print "Clean Up Data\n";
if ($type_hash) {
## Method to fix hash (Trying Everything i can think of!
my $n = 1;
my $total_loops = 100_000;
until ($n > $total_loops) {
undef $share_hash{$n};
$n += 1;
}
%share_hash = ();
$share_hash = ();
undef $share_hash;
undef %share_hash;
}
if ($type_scalar) {
undef $share_var;
}
}
#### Check Memory Usage
sub Memory_Check {
## Get current memory from shell
my #mem = `ps aux | grep \"$$\"`;
my($results) = grep !/grep/, #mem;
## Parse Data from Shell
chomp $results;
$results =~ s/^\w*\s*\d*\s*\d*\.\d*\s*\d*\.\d*\s*//g; $results =~ s/pts.*//g;
my ($vsz,$rss) = split(/\s+/,$results);
## Format Numbers to Human Readable
my $h = Number::Bytes::Human->new();
my $virt = $h->format($vsz);
my $h = Number::Bytes::Human->new();
my $res = $h->format($rss);
print "Current Memory Usage: Virt: $virt RES: $res\n";
if ($type_hash) {
my $total_size = total_size(\%share_hash);
my #arr_c = keys %share_hash;
print "Length of Hash: " . ($#arr_c + 1) . " Hash Mem Total Size: $total_size\n";
}
if ($type_scalar) {
my $total_size = total_size($share_var);
print "Length of Scalar: " . length($share_var) . " Scalar Mem Total Size: $total_size\n";
}
}
OUTPUT:
./Memory_Undef_Simple.cgi
Current Memory Usage: Virt: 6.9K RES: 2.7K
Length of Hash: 0 Hash Mem Total Size: 92
Length of Scalar: 0 Scalar Mem Total Size: 12
Hit Enter (add to memory):
Adding data to shared varable 100000 times
Done Adding Data
Current Memory Usage: Virt: 228K RES: 224K
Length of Hash: 100000 Hash Mem Total Size: 116813243
Length of Scalar: 111100000 Scalar Mem Total Size: 111100028
Hit Enter (Set Varable to nothing):
Current Memory Usage: Virt: 228K RES: 224K
Length of Hash: 100000 Hash Mem Total Size: 116813243
Length of Scalar: 0 Scalar Mem Total Size: 111100028
Hit Enter (clean data):
Clean Up Data
Current Memory Usage: Virt: 139K RES: 135K
Length of Hash: 0 Hash Mem Total Size: 92
Length of Scalar: 0 Scalar Mem Total Size: 24
Hit Enter (start over):
So as you can see the memory goes down, but it only goes down the size of the scalar. Any ideas how to free the memory of the hash?
Also Devel::Size shows the hash is only taking up 92 bytes even though the program still is using 139K.
Generally, yeah, that's how memory management on UNIX works. If you are using Linux with a recent glibc, and are using that malloc, you can return free'd memory to the OS. I am not sure Perl does this, though.
If you want to work with large datasets, don't load the whole thing into memory, use something like BerkeleyDB:
https://metacpan.org/pod/BerkeleyDB
Example code, stolen verbatim:
use strict ;
use BerkeleyDB ;
my $filename = "fruit" ;
unlink $filename ;
tie my %h, "BerkeleyDB::Hash",
-Filename => $filename,
-Flags => DB_CREATE
or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;
# Add a few key/value pairs to the file
$h{apple} = "red" ;
$h{orange} = "orange" ;
$h{banana} = "yellow" ;
$h{tomato} = "red" ;
# Check for existence of a key
print "Banana Exists\n\n" if $h{banana} ;
# Delete a key/value pair.
delete $h{apple} ;
# print the contents of the file
while (my ($k, $v) = each %h)
{ print "$k -> $v\n" }
untie %h ;
(OK, not verbatim. Their use of use vars is ... legacy ...)
You can store gigabytes of data in a hash this way, and you will only use a tiny bit of memory. (Basically, whatever BDB's pager decides to keep in memory; this is controllable.)
In general, you cannot expect perl to release memory to the OS.
See the FAQ: How can I free an array or hash so my program shrinks?.
You usually can't. Memory allocated to lexicals (i.e. my() variables) cannot be reclaimed or reused even if they go out of scope. It is reserved in case the variables come back into scope. Memory allocated to global variables can be reused (within your program) by using undef() and/or delete().
On most operating systems, memory allocated to a program can never be returned to the system. That's why long-running programs sometimes re- exec themselves. Some operating systems (notably, systems that use mmap(2) for allocating large chunks of memory) can reclaim memory that is no longer used, but on such systems, perl must be configured and compiled to use the OS's malloc, not perl's.
It is always a good idea to read the FAQ list, also installed on your computer, before wasting your time.
For example, How can I make my Perl program take less memory? is probably relevant to your issue.
Why do you want Perl to release the memory to the OS? You could just use a larger swap.
If you really must, do your work in a forked process, then exit.
Try recompiling perl with the option -Uusemymalloc to use the system malloc and free. You might see some different results

What's the fastest way in Perl to get all lines of file1 that do not appear in file2?

I have two (very large) text files. What is the fastest way - in terms of run time - to create a third file containing all lines of file1 that do not appear in file2?
So if file1 contains:
Sally
Joe
Tom
Suzie
And file2 contains:
Sally
Suzie
Harry
Tom
Then the output file should contain:
Joe
Create a hashmap containing each line from file 2. Then for each line in file 1, if it is not in the hashmap then output it. This will be O(N), which is the best efficiency class you can achieve given that you have to read the input.
Perl implementation:
#!/usr/bin/env perl
use warnings;
use strict;
use Carp ();
my $file1 = 'file1.txt';
my $file2 = 'file2.txt';
my %map;
{
open my $in, '<',$file2 or Carp::croak("Cant open $file2");
while (<$in>) {
$map{$_} = 1;
}
close($in) or Carp::carp("error closing $file2");
}
{
open my $in,'<', $file1 or Carp::croak("Cant open $file1");
while (<$in>) {
if (!$map{$_}) {
print $_;
}
}
close $in or Carp::carp("error closing $file1");
}
If file 2 is so large that the hashmap doesn't fit in memory, then we have a different problem at hand. A possible solution is then to use the above solution on chunks of file 2 (small enough to fit into memory), outputing the results to temporary files. Provided there are sufficient matches between file 1 and file 2, then total output should be of reasonable size. To calculate the final results, we perform an intersection of the lines in temporary files, i.e. for a line to be in the final results it must occur in every temporary file.
Not Perl but effective:
diff <(sort file1) <(sort file2)
I'm surprised no-one has yet suggested the most memory-efficient way, which is to sort both files separately, then perform a list merge. If you're on Unix, the following 3 simple commands will do what you need, quickly:
sort < file1 > file1.sorted
sort < file2 > file2.sorted
comm -2 -3 file1.sorted file2.sorted > differences
If the files are so large that Perl has to page VM to load all the lines into a hashtable, this technique will be much faster. Otherwise, a hashtable-based approach should be faster.
If you're on Unix, it's better to use your system's external sort command as it is intelligent about memory usage -- using Perl's sort() entails reading the entire contents of the file into memory.
If you're on Windows, note that the supplied sort command is case-insensitive and this can't be turned off! Also there is no comm command on Windows, so you'll need to roll your own -- replace the third line above with:
perl subtract_sets.pl file1.sorted file2.sorted > differences.txt
subtract_sets.pl
#!/usr/bin/perl
open my $f1, '<', $ARGV[0] or die;
open my $f2, '<', $ARGV[1] or die;
my $x = <$f1>;
my $y = <$f2>;
while (defined $x && defined $y) {
if ($x lt $y) {
print $x;
$x = <$f1>;
} elsif ($y lt $x) {
print $y;
$y = <$f2>;
} else {
# Lines match
$x = <$f1>;
$y = <$f2>;
}
}
while (defined $x) {
print $x;
$x = <$f1>;
}
while (defined $y) {
print $y;
$y = <$f2>;
}
#!/usr/bin/perl
use warnings;
use strict;
open(my $alpha, '<', 'file1') || die "Couldn't open file1";
open(my $beta, '<' , 'file2') || die "Couldn't open file2";
my %data;
map {$data{$_} = 1} <$alpha>;
map {print $_ unless $data{$_}} <$beta>;
What is very large to you? Larger than your RAM? Your basic answer is to use a hash, if the files are larger than your RAM then you need to use a hash tied to a dbm.
Just some efficiency benchmarks:
10k lines, 10-characters random strings max per line.
Rate slipset marcog
slipset 47.6/s -- -16%
marcog 56.7/s 19% --
100k lines, 10-characters random strings max per line.
Rate slipset marcog
slipset 3.02/s -- -34%
marcog 4.60/s 52% -
1000k lines, 10-characters random strings max per line.
s/iter slipset marcog
slipset 4.09 -- -33%
marcog 2.75 49% --
1k lines, 100-characters random strings max per line.
Rate slipset marcog
slipset 379/s -- -12%
marcog 431/s 14% --
100k lines, 100-characters random strings max per line
Rate slipset marcog
slipset 2.15/s -- -30%
marcog 3.08/s 44% --
1k lines, 1000-character random strings max per line
Rate slipset marcog
slipset 133/s -- -10%
marcog 148/s 11% --
100k lines, 1000-character random strings max per line
Rate slipset marcog
slipset 1.01/s -- -18%
marcog 1.22/s 22% --
Memory Efficiency
Marcog: 100k lines, 1000-character random strings max per line:
Memory usage summary: heap total: 163_259_635, heap peak: 61_536_800, stack peak: 17_648
total calls total memory failed calls
malloc| 307_425 162_378_090 0
realloc| 1_461 96_878 0 (nomove:1_218, dec:1_026, free:0)
calloc| 12_762 784_667 0
free| 307_598 155_133_460
Slipset: 100k lines, 1000-character random strings max per line:
Memory usage summary: heap total: 647_103_469, heap peak: 118_445_776, stack peak: 17_648
total calls total memory failed calls
malloc| 508_089 186_752_811 0
realloc| 399_907 459_553_775 0 (nomove:334_169, dec:196_380, free:0)
calloc| 12_765 796_883 0
free| 507_584 256_315_688

How do I get the size of a file in megabytes using Perl?

I want to get the size of a file on disk in megabytes. Using the -s operator gives me the size in bytes, but I'm going to assume that then dividing this by a magic number is a bad idea:
my $size_in_mb = (-s $fh) / (1024 * 1024);
Should I just use a read-only variable to define 1024 or is there a programmatic way to obtain the amount of bytes in a kilobyte?
EDIT: Updated the incorrect calculation.
If you'd like to avoid magic numbers, try the CPAN module Number::Bytes::Human.
use Number::Bytes::Human qw(format_bytes);
my $size = format_bytes(-s $file); # 4.5M
This is an old question and has been already correctly answered, but just in case your program is constrained to the core modules and you can not use Number::Bytes::Human here you have several other options I have been collected over time. I have kept them also because each one use a different Perl approach and is a nice example for TIMTOWTDI:
example 1: uses state to avoid reinitialize the variable each time (before perl 5.16 you need to use feature state or perl -E)
http://kba49.wordpress.com/2013/02/17/format-file-sizes-human-readable-in-perl/
sub formatSize {
my $size = shift;
my $exp = 0;
state $units = [qw(B KB MB GB TB PB)];
for (#$units) {
last if $size < 1024;
$size /= 1024;
$exp++;
}
return wantarray ? ($size, $units->[$exp]) : sprintf("%.2f %s", $size, $units->[$exp]);
}
example 2: using sort map
.
sub scaledbytes {
# http://www.perlmonks.org/?node_id=378580
(sort { length $a <=> length $b
} map { sprintf '%.3g%s', $_[0]/1024**$_->[1], $_->[0]
}[" bytes"=>0]
,[KB=>1]
,[MB=>2]
,[GB=>3]
,[TB=>4]
,[PB=>5]
,[EB=>6]
)[0]
}
example 3: Take advantage of the fact that 1 Gb = 1024 Mb, 1 Mb = 1024 Kb and 1024 = 2 ** 10:
.
# http://www.perlmonks.org/?node_id=378544
my $kb = 1024 * 1024; # set to 1 Gb
my $mb = $kb >> 10;
my $gb = $mb >> 10;
print "$kb kb = $mb mb = $gb gb\n";
__END__
1048576 kb = 1024 mb = 1 gb
example 4: use of ++$n and ... until .. to obtain an index for the array
.
# http://www.perlmonks.org/?node_id=378542
#! perl -slw
use strict;
sub scaleIt {
my( $size, $n ) =( shift, 0 );
++$n and $size /= 1024 until $size < 1024;
return sprintf "%.2f %s",
$size, ( qw[ bytes KB MB GB ] )[ $n ];
}
my $size = -s $ARGV[ 0 ];
print "$ARGV[ 0 ]: ", scaleIt $size;
Even if you can not use Number::Bytes::Human, take a look at the source code to see all the things that you need to be aware of.
You could of course create a function for calculating this. That is a better solution than creating constants in this instance.
sub size_in_mb {
my $size_in_bytes = shift;
return $size_in_bytes / (1024 * 1024);
}
No need for constants. Changing the 1024 to some kind of variable/constant won't make this code more readable.
Well, there's not 1024 bytes in a meg, there's 1024 bytes in a K, and 1024 K in a meg...
That said, 1024 is a safe "magic" number that will never change in any system you can expect your program to work in.
I would read this into a variable rather than use a magic number. Even if magic numbers are not going to change, like the number of bytes in a megabyte, using a well named constant is a good practice because it makes your code more readable. It makes it immediately apparent to everybody else what your intention is.
1) You don't want 1024. That gives you kilobytes. You want 1024*1024, or 1048576.
2) Why would dividing by a magic number be a bad idea? It's not like the number of bytes in a megabyte will ever change. Don't overthink things too much.
Don't get me wrong, but: I think that declaring 1024 as a Magic Variable goes a bit too far, that's a bit like "$ONE = 1; $TWO = 2;" etc.
A Kilobyte has been falsely declared as 1024 Bytes since more than 20 years, and I seriously doubt that the operating system manufacturers will ever correct that bug and change it to 1000.
What could make sense though is to declare non-obvious stuff, like "$megabyte = 1024 * 1024" since that is more readable than 1048576.
Since the -s operator returns the file size in bytes you should probably be doing something like
my $size_in_mb = (-s $fh) / (1024 * 1024);
and use int() if you need a round figure. It's not like the dimensions of KB or MB is going to change anytime in the near future :)