Optimize Perl script - runs too slow on 40GB+ files

Optimize Perl script - runs too slow on 40GB+ files - perl

I made the following Perl script to handle some file manipulation at work, but it's running far too slowly at the minute to be put in production.
I don't know Perl very well (not one of my languages), so can someone help me identify and replace parts of this script that would be slow given it's processing ~40 million lines?
Data being piped in is in the format:
col1|^|col2|^|col3|!|
col1|^|col2|^|col3|!|
... 40 million of these.
The date_cols array is calculated before this part of the script and basically holds the index of columns containing dates in the pre-converted format.
Here's the part of the script that will be executed for every input row. I've cleaned it up a little and added comments, but let me know if anything else is needed:
## Read from STDIN until no more lines are arailable.
while (<STDIN>)
{
## Split by field delimiter
my #fields = split('\|\^\|', $_, -1);
## Remove the terminating delimiter from the final field so it doesn't
## interfere with date processing.
$fields[-1] = (split('\|!\|', $fields[-1], -1))[0];
## Cycle through all column numbres in date_cols and convert date
## to yyyymmdd
foreach $col (#date_cols)
{
if ($fields[$col] ne "")
{
$fields[$col] = formatTime($fields[$col]);
}
}
print(join('This is an unprintable ASCII control code', #fields), "\n");
}
## Format the input time to yyyymmdd from 'Dec 26 2012 12:00AM' like format.
sub formatTime($)
{
my $col = shift;
if (substr($col, 4, 1) eq " ") {
substr($col, 4, 1) = "0";
}
return substr($col, 7, 4).$months{substr($col, 0, 3)}.substr($col, 4, 2);
}

If written purely for efficiency, I'd write your code like this:
sub run_loop {
local $/ = "|!|\n"; # set the record input terminator
# to the record seperator of our problem space
while (<STDIN>) {
# remove the seperator
chomp;
# Split by field delimiter
my #fields = split m/\|\^\|/, $_, -1;
# Cycle through all column numbres in date_cols and convert date
# to yyyymmdd
foreach $col (#date_cols) {
if ($fields[$col] ne "") {
# $fields[$col] = formatTime($fields[$col]);
my $temp = $fields[$col];
if (substr($temp, 4, 1) eq " ") {
substr($temp, 4, 1) = "0";
}
$fields[$col] = substr($temp, 7, 4).$months{substr($temp, 0, 3)}.substr($temp, 4, 2);
}
}
print join("\022", #fields) . "\n";
}
}
The optimizations are:
Using chomp to remove the |!|\n string at the end
Inlining the formatTime sub.
Subroutine calls are extremely expensive in Perl. If subs have to be used very efficiently, prototype checking can be disabled with the &subroutine(#args) syntax. If #args are ommited, the current arguments #_ are visible to the called sub. This can lead to bugs or additional performance. Use wisely. The goto &subroutine; syntax can be used as well, but this meddles with return (basically a tail call). Do not use.
Further optimizations could include removing the hash lookup %months, as hashing is expensive.

You'll have to benchmark on your data set to compare, but you can throw a regex at it. (Made all the worse by your very regex-unfriendly field and record separators!)
my $i = 0;
our %months = map { $_ => sprintf('%02d', ++$i) } qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
while (<DATA>) {
s! \|\^\| !\022!xg; # convert field separator
s/ \| !\| $ //xg; # strip record terminator
s/\b(\w{3}) ( \d|\d\d) (\d{4}) \d\d:\d\d[AP]M\b/${3} . $months{$1} . sprintf('%02d', $2) /eg;
print;
}
Won't do what you want if one of the non-#date_cols fields matches the date regex.

At my work sometimes i need to grep errorlogs etc from 350+ frontends. I use script template i calling "SMP grep" ;) Its simple:
stat file, get file length
Get "chunk length" = file_length / num_processors
Andjust chunk starts and ends so they start/end at "\n". Just read(), find "\n" and calculate offsets.
fork() to make num_processor workers, each working on own chunk
This can help if you use regexps in your grep or other CPU operations(as your case i think). Admins complaining this script eats disk throughput, but its only bottleneck here if server has 8 CPUs =) Also, obviously if you need to parse 1 week data you can divide between servers.
Tomorrow i can post the code if interested.

Related

Read perl file handle with $INPUT_RECORD_SEPARATOR as a regex

I'm looking for a way to read from a file handle line by line (and then execute a function on each line) with the following twist: what I want to treat as a "line" shall be terminated by varying characters and not just a single character that I define as $/. I now that $INPUT_RECORD_SEPARATOR or $/ do not support regular expressions or passing a list of characters to be treated as line terminators and this is where my problem lies.
My file handle comes from stdout of a process. Thus, I cannot seek inside the file handle and the full content is not available immediately but is produced bit by bit as the process is executed. I want to be able to attach things like a timestamp to each "line" the process produces using a function that I called handler in my examples. Each line should be handled as soon as it gets produced by the program.
Unfortunately, I can only come up with a way that either executes the handler function immediately but seems horribly inefficient or a way that uses a buffer but will only lead to "grouped" calls of the handler function and thus, for example, produce wrong timestamps.
In fact, in my specific case, my regex would even be very simple and just read /\n|\r/. So for this particular problem I don't even need full regex support but just the possibility to treat more than one character as the line terminator. But $/ doesn't support this.
Is there an efficient way to solve this problem in Perl?
Here is some quick pseudo-perl code to demonstrate my two approaches:
read the input file handle byte-by-byte
This would look like this:
my $acc = "";
while (read($fd, my $b, 1)) {
$acc .= $b;
if ($acc =~ /someregex$/) {
handler($acc);
$acc = "";
}
}
The advantage here is, that handler gets immediately dispatched once enough bytes are read. The disadvantage is, that we do string appending and check the regex for every single byte we read from $fd.
read the input file handle with blocks of X-byte at a time
This would look like this:
my $acc = "";
while (read($fd, my $b, $bufsize)) {
if ($b =~ /someregex/) {
my #parts = split /someregex/, $b;
# for brevity lets assume we always get more than 2 parts...
my $first = shift #parts;
handler(acc . $first);
my $last = pop #parts;
foreach my $part (#parts) {
handler($part);
}
$acc = $last;
}
}
The advantage here is, that we are more efficient as we only check every $bufsize bytes. The disadvantage is, that the execution of handler has to wait until $bufsize bytes have been read.

Setting $INPUT_RECORD_SEPARATOR to a regex wouldn't help, because Perl's readline uses buffered IO, too. The trick is to use your second approach but with unbuffered sysread instead of read. If you sysread from a pipe, the call will return as soon as data is available, even if the whole buffer couldn't be filled (at least on Unix).

The suggestion by nwellnhof allowed me to implement a solution to this problem:
my $acc = "";
while (1) {
my $ret = sysread($fh, my $buf, 1000);
if ($ret == 0) {
last;
}
# we split with a capturing group so that we also retain which line
# terminator was used
# a negative limit is used to also produce trailing empty fields if
# required
my #parts = split /(\r|\n)/, $buf, -1;
my $numparts = scalar #parts;
if ($numparts == 1) {
# line terminator was not found
$acc .= $buf;
} elsif ($numparts >= 3) {
# first match needs special treatment as it needs to be
# concatenated with $acc
my $first = shift #parts;
my $term = shift #parts;
handler($acc . $first . $term);
my $last = pop #parts;
for (my $i = 0; $i < $numparts - 3; $i+=2) {
handler($parts[$i] . $parts[$i+1]);
}
# the last part is put into the accumulator. This might
# just be the empty string if $buf ended in a line
# terminator
$acc = $last;
}
}
# if the output didn't end with a linebreak, handle the rest
if ($acc ne "") {
handler($acc);
}
My tests show that indeed sysread will return even before having read 1000 characters if there is a pause in the input stream. The code above takes care to concatenate multiple messages of length 1000 and split messages with a lesser length or multiple terminators correctly.
Please shout if you see any bug in above code.

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago

Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.

I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.
To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb
Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:
abcde => 1
Next line, it moves one character to the right and does the same:
bcdef => 1
It then continues for the rest of the string and prints the 5 most frequent sequences:
open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'
my $seq = <$in>; # read whole file into string
my $len = length($seq);
my $seq_length = 5; # set k-mer length
my %data;
for (my $i = 0; $i <= $len - $seq_length; $i++) {
my $kmer = substr($seq, $i, $seq_length);
$data{$kmer}++;
}
# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
print "$kmer $data{$kmer}\n";
$count++;
last if $count >= 5;
}
ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1
However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.
I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.
My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.
Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?

From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once
This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory
The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer
After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file
I've also use constants for the K-mer size and the chunk size. They are constant after all!
The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort
use strict;
use warnings 'all';
use constant SEQ_LENGTH => 5; # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB
my $in_file = shift // 'in.txt';
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};
my %data;
my $chunk;
my $length = 0;
while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {
$length += $size;
for my $offset ( 0 .. $length - SEQ_LENGTH ) {
my $kmer = substr $chunk, $offset, SEQ_LENGTH;
++$data{$kmer};
}
$chunk = substr $chunk, -(SEQ_LENGTH-1);
$length = length $chunk;
}
my #kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for #kmers[0..4];
output
ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1
Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.
The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.

Even if you don't read the entire file into memory before processing it, you could still run out of memory.
A 10 GiB file contains almost 11E9 sequences.
If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.
If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.
In that case, I suggest doing the following:
Create a file that contains all the sequences of the original file.
The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.
perl -e'
use strict;
use warnings qw( all );
use constant SEQ_LENGTH => 20;
use constant CHUNK_SIZE => 1024 * 1024;
my $buf = "";
while (1) {
my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
die($!) if !defined($size);
last if !$size;
for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
print(substr($buf, $offset, SEQ_LENGTH), "\n");
}
substr($buf, 0, -(SEQ_LENGTH-1), "");
}
' <in.txt >sequences.txt
Sort the sequences.
sort sequences.txt >sorted_sequences.txt
Count the number of instances of each sequeunces, and store the count along with the sequences in another file.
perl -e'
use strict;
use warnings qw( all );
my $last = "";
my $count;
while (<>) {
chomp;
if ($_ eq $last) {
++$count;
} else {
print("$count $last\n") if $count;
$last = $_;
$count = 1;
}
}
' sorted_sequences.txt >counted_sequences.txt
Sort the sequences by count.
sort -rns counted_sequences.txt >sorted_counted_sequences.txt
Extract the results.
perl -e'
use strict;
use warnings qw( all );
my $last_count;
while (<>) {
my ($count, $seq) = split;
last if $. > 5 && $count != $last_count;
print("$seq $count\n");
$last_count = $count;
}
' sorted_counted_sequences.txt
This also prints ties for 5th place.
This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.
sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.
Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.

Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.
So if you can turn this into a regex-based solution that's much better.
Here's an attempt to do that:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
3 ebfeb
2 febfe
2 bfebf
1 gbacb
1 fabcg
I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.
I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.
A slightly altered implementation that does this all in perl:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca
Pretty formatting for the code behind that:
my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
substr($str, 0, 1) = "" if $pos;
$occur{$_}++ for $str =~ m/(.{5})/gs;
}
for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
say "$occur{$k} $k";
}

The most straightforward approach is to use the substr() function:
% time perl -e '$/ = \1048576;
while ($s = <>) { for $i (0..length $s) {
$hash{ substr($s, $i, 5) }++ } }
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
269.79 real 268.92 user 0.66 sys
The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from #Jonathan Leffler, #ÆvarArnfjörðBjarmason, #Vorsprung, #ThisSuitIsBlackNotm #borodin and #ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.
As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. #ikegami's and #Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both #borodin and #ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.
Using Stream::Reader
Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.
use Stream::Reader;
use strict;
use warnings;
open( my $handler, "<", shift );
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } );
my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) {
foreach my $i (0..length $string) {
$hash{ substr($string, $i, 5) }++
}
}
my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.
NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210
The Stream::Reader based script was significantly faster:
time perl sequence_search_stream-reader.pl nucleotide.data
252.12s
time perl sequence_search_borodin.pl nucleotide.data
350.57s
The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:
% wc nucleotide.data
0 0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA
I used this command to create the file:
perl -MString::Random=random_regex -e '
open (my $fh, ">>", "nucleotide.data");
for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'
Lists and Strings
Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.
I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:
use List::Gen 'slide';
$/ = \1048575; # Read a million character/bytes at a time.
my %hash;
while (my $seq = <>) {
chomp $seq;
foreach my $kmer (slide { join("", #_) } 5 => split //, $seq) {
next unless length $kmer == 5;
$hash{$kmer}++;
}
}
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from #Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).
Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.
Postscript and Results
The List::Gen approach was consistently between 5 and 6 times slower than #Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:
1 million character nucleotide string
sequence_search_stream-reader.pl : 0.26s
sequence_search_borodin.pl : 0.39s
sequence_search_listgen.pl : 2.04s
83 million character nucleotide string
With the data in file xaa:
wc xaa
0 1 83886080 xaa
% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
21.33 real 20.95 user 0.35 sys
% time perl sequence_search_borodin.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
28.13 real 28.08 user 0.03 sys
% time perl sequence_search_listgen.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
157.54 real 156.93 user 0.45 sys
1 billion character nucleotide string
In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.
sequence_search_stream-reader.pl : 252.12s
sequence_search_borodin.pl : 350.57s
sequence_search_listgen.pl : 1928.34s
The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.
I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.
1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.
echo "scale=10; 100 * (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

Optimize perl script to filter rows based on date in the file

I am a beginner with programming not just perl !
Please let me know what needs to change or how else this can be done.
Need to optimize the perl code to run faster.
For a test run, with around a 500MB file with 3 million rows in it, runtime is 28 minutes.
I know a tool which processes the 39 million rows in 15 mins, but i want to acheive this running on the command prompt without resorting to the tool.
Earlier I used Date::Manip and Date::Parse and moved on to DateTime, thinking it should be faster.
My approach was If the dates are ISO-8601 (ie, YYYY-MM-DD) and we do not need to validate them,
we can compare lexicographically (ie, the lt and gt operators.)
Input File Date Format is 07/18/2013 13:45:49
Input File Size 42GB.
Number of Rows 39 Million.
Column Delimiter : |~|
Platform : GNU/Linux
I have tried ">" and "gt" and did not find any difference in runtime.
Code snippet:
use DateTime::Format::Strptime;
my $idate = "07/17/2013 00:00:00";
my $Strp = DateTime::Format::Strptime->new(
pattern => '%m/%d/%Y %H:%M:%S',
);
my $inputdt = $Strp->parse_datetime($idate);
open (FILE,"myinputfile.dat") or die "could not input File\n";
while (defined(my $line = <FILE>)) {
my #chunks = split '[|]~[|]', $line;
my $fdate = $Strp->parse_datetime($chunks[6]);
if ( $fdate > $inputdt) {
open(FILEOUT, ">>myoutputfile.dat") or die "Could not write\n";
print FILEOUT "$line";
}
}
close(FILE);
close (FILEOUT);

There are two and a half big performance problems here:
You open the output file in every iteration. Just open it once, before the loop.
The parse_datetime returns a DateTime object. Object orientation with Perl implies a significant overhead. Because your pattern is well defined, we can do the parsing ourself, and remove all object orientation.
Reading a file in the GB range just takes some time. To speed this up, upgrade your hardware (e.g. to a SSD).
To parse the date string into a sortable representation, we just reorder the various parts to a string:
# %m/%d/%Y %H:%M:%S → %Y/%m/%d %H:%M:%S
$fdate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x;
if ($fdate gt $inputdate) { ... }
This would lead to the code
use strict; use warnings;
use constant DATE_FIELD => shift #ARGV;
my $inputdate = shift #ARGV;
$inputdate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x;
<>; # remove the header line
while (<>) {
my $filedate = (split /\|~\|/, $_, DATE_FIELD + 2)[DATE_FIELD];
$filedate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x;
print if $filedate gt $inputdate;
}
The in- and output, as well as the start date, are specified on the command line, e.g.
./script 6 '07/17/2013 00:00:00' myinputfile.dat >>myoutputfile.dat

Substitute Array Elements with Hash Values

I am trying to code a Perl Script which will take the date in Pattern, October 24, 2011 and convert this to 10,24,2011.
In order to do this I have prepared a Hash which will have the Month Name as a Key and a Numerical value representing Month's position as a Value.
I will read the input string, use a regular expression to extract the month name from above format.
Replace this month name with a value which corresponds to the month as a key.
Here's the script I have coded so far, but it's not working for me.
#dates array will have every element in this format -> October 24, 2011.
%days=("January",01,"February",02,"March",03,"April",04,"May",05,"June",06,"July",07,"August",08,"September",09,"October",10,"November",11,"December",12);
#output = map{
$pattern=$_;
$pattern =~ s/(.*)\s/$days{$1};
} #dates;
foreach $output (#output)
{
print $output."\n";
}
Here's a little explanation of what I am trying to do with this code.
#output will have the new formatted array with the Month Name replaced by the corresponding Numerical representing it as defined in the Hash.
map function is used to transform the elements of the array on the fly.
a sequence of characters followed by space is the regular expression used to extract the Month Name from pattern, October 24, 2011.
This will be referenced by $1.
I look up the corresponding value for $1 in the hash using, $days{$1}

I see a few problems here. The first is that there is no use strict;.
A number with a leading zero is assumed to be in octal format (i.e. base 8) so 08 is invalid. You want one of these:
%days = ("January", 1, "February", 2, ...
%days = ("January", "01", "February", "02", ...
%days = ("January" => 1, "February" => 2, ...
%days = ("January" => "01", "February" => "02", ...
You should also be declaring your variables with my:
my %days = ...
my #output = ...
You're missing the final slash on your substitution, you probably want a comma in there to match your desired output format, and .* will eat up more than you want:
$pattern =~ s/(\S*)\s/$days{$1}, /;
The block for your map needs to return the value you want in #output but it currently returns 1 (see perldoc perlop to learn why); something like this will serve you better:
my #output = map {
my $pattern=$_; # You don't need this, operating on $_ is fine here
$pattern =~ s/(\S*)\s/$days{$1}, /;
$pattern
} #dates;
If you really want the spaces removed from the output, then this should do the trick:
my #output = map {
my $pattern=$_; # You don't need this, operating on $_ is fine here
$pattern =~ s/(\S*)\s/$days{$1}, /;
$pattern =~ s/\s//g;
$pattern
} #dates;
There are more compact ways to do this map but I don't want to change too much and confuse you.
And, as mentioned in the comments, you might want to save yourself some trouble and have a look at DateTime and related packages.

Leaving aside the fact that you pasted non-compiling code (forgot training "/" as sarnold said), your regex is wrong.
You used a GREEDY regex: .* - meaning take as many characters as possible while matching. So your regex matched October 24, instead if October.
You need to do \S+\s

Do you want to "substitute array elements with hash values," or do you want to map month names to numbers. If it's the latter, the following will convert month_name day year to month_number day year with less code:
perl -le '$d=$ARGV[0]; for (qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}) { $i++; last if $d =~ s/\b$_[^\s]*/$i/i; }; print $d' "october 24, 2011"

Here's some feedback on your code:
Your pasted code does not compile very well.
You didn't use strict and warnings.
01 to 09 needs to be in double quotes.
You do not need to reassign $_ inside your map statement.
map needs to end with the value you intend to insert, e.g.: map { s/(\w+)/$days{$1}/; $_ }
say for #output looks nicer. =)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimize Perl script - runs too slow on 40GB+ files - perl

Related

Read perl file handle with $INPUT_RECORD_SEPARATOR as a regex

Regular expression to print a string from a command outpout

Count subsequences in hundreds of GB of data

Optimize perl script to filter rows based on date in the file

Substitute Array Elements with Hash Values

Categories

Resources