Help writing flexible splits, perl - perl

A couple weeks ago I posted a question about trouble I was having parsing an irregularly-formatted data file. Here's a sample of the data:
01-021412 15/02/2007 207,000.00 14,839.00 18 -6 2 6 6 5 16 6 4 4 3 -28 -59 -88 -119
-149 -191 -215 -246
Atraso Promedio ---> 2.88
I need a program that would extract 01-021412, 18, count and sum all the digits in the subsequent series, and store atraso promedio, and that could repeat this operation for over 40,000 entires. I received a very helpful response, and from that was able to write the code:
use strict;
use warnings;
#Create an output file
open(OUT, ">outFull.csv");
print OUT "loanID,nPayments,atrasoPromedio,atrasoAlt,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72\n";
open(MYINPUTFILE, "<DATOS HISTORICO ASPIRE2.txt");
my #payments;
my $numberOfPayments;
my $loanNumber;
while(<MYINPUTFILE>)
{
if(/\b\d{2}-\d{6}\b/)
{
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
}
elsif(m/---> *(\d*.\d*)/)
{
my (undef, undef, undef, $atrasoPromedio) = split;
my $N = scalar #payments;
print "$numberOfPayments,$N,$loanNumber\n";
if($N==$numberOfPayments){
my $total = 0;
($total+=$_) for #payments;
my $atrasoAlt = $total/$N;
print OUT "$loanNumber,$numberOfPayments,$atrasoPromedio,$atrasoAlt,",join( ',', #payments),"\n";
}
}
else
{
push(#payments, split);
}
}
This would work fine, except for the fact that about 50 percent of entries include an '*' as follows:
* 01-051948 06/03/2009 424,350.00 17,315.00 48 0 6 -2 0 21 10 9 13 10 9 7 13 3 4
12 -3 14 8 6
Atraso Promedio ---> 3.02
The asterisk causes the program to fail because it interrupts the split pattern, causing incorrect variable assignments. Until now I've dealt with this by removing the asterisks from the input data file, but I just realized that by doing this the program actually omits these loans altogether. Is there an economical way to modify my script so that it handles entries with and without asterisks?
As an aside, if an entry does include an asterisk I would like to record this fact in the output data.
Many thanks in advance,
Aaron

Use an intermediate array:
my $has_asterisk;
# ...
if(/\b\d{2}-\d{6}\b/)
{
my #fields = split;
$has_asterisk = $fields[0] eq '*';
shift #fields if $has_asterisk;
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = #fields;
}

You could discard the asterisk before doing the split :
while(<MYINPUTFILE>) {
s/^\s*\*\s*//;
if(/\b\d{2}-\d{6}\b/) {
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
...
And, apart of this, you should use 3 args open, lexical filehandles and test open for failure.
my $file = 'DATOS HISTORICO ASPIRE2.txt';
open my $MYINPUTFILE, '<', $file or die "unable to open '$file' for reading : $!";

so it looks like your first if statement regex is not accounting for that '*', so how about we modify it. my perl regex skillz are a little rusty, note that this is untested.
if(/(?:\* )?\b\d{2}-\d{6}\b/)
* is a modifier meaning "zero or more times" so we need to escape it, \*
(?: ) means "group this together but don't save it", I just use that so I can apply the ? to both the space and * at the same time

At beginning of the while loop, try this:
...
while(<MYINPUTFILE>)
{
my $asterisk_exists = 0;
if (s/^\* //) {
$asterisk_exists = 1;
}
...
In addition to removing the asterisk by using the s/// function, you also keep track of whether or not the asterisk was there in the first place. With the asterisk removed, the rest of your script should function as normal.

Related

run, save variable state, interrupt execution, and reload variables from the last state before the interrupt [duplicate]

This question already has answers here:
How Do I Persist A Scalar Value Across Program Executions?
(3 answers)
Closed 1 year ago.
I am researching a way to interrupt the execution of the Perl script below in a way that I can save at what point the interaction was and I can run the next time starting from this interaction.
Basically I saw other people speaking "save Perl's execution in the state in which it was".
I'm looking inside the script and trying to understand how I can extract the interaction number and print out to a file and then reinsert in the script as the initial index of the next starting running.
use strict;
use warnings;
sub powerset(&#) {
my $callback = shift;
my $bitmask = '';
my $bytes = #_/8;
{
my #indices = grep vec($bitmask, $_, 1), 0..$#_;
$callback->( #_[#indices] );
++vec($bitmask, $_, 8) and last for 0 .. $bytes;
redo if #indices != #_;
}
}
powerset { print "[#_]\n" } 1..5;
my $i = 0;
For sample: Let's say I want the powerset of a set of 1..50, and then at the time I interrupt the execution the last line printed on the output is [1 2 3 4 6 7 8 11 12 13 15 16 17]. Then on the second startup of the script execution I just want to iterate just about the respective indice to the line [1 2 3 4 6 7 8 11 12 13 15 16 17] that was printed, I do not want to redo the iterations that have already had the result printed on the first run of the Perl file.
My goal is to be able to start a first run of file.pl and then print the output in an out.txt file, perl file.pl > out.txt, and then move the out.txt file to another directory (ie delete out.txt from the current directory). Then I must be able to continue interaction from the respective index to the line [1 2 3 4 6 7 8 11 12 13 15 16 17] that has already been printed.
Note: I know: "to suspend the program, you'd type ^ z (= ctrl-z) 1 while the program is running. Then Type BG TO SEND THE (STOPPED) Process in The Background. To Resume It Again, Type FG (or fg in case you have severe backgrounded - Type Jobs to List Them). " But that's not quite what I expected.
Perhaps what you are looking for can be demonstrated with following code snippet implementing checkpoint approach.
The code implements two functions to store/restore execution stage and data_set.
If the code is interrupted in the middle of the execution, then following re-run will continue starting after completed stage by restoring saved earlier data_set.
NOTE: for demonstration put exit after store_checkpoint($data) at the end of STAGE2. Run the script once, remove added exit and re-run the script once more.
use strict;
use warnings;
use feature 'say';
use YAML;
use Data::Dumper;
my $checkpoint = 'checkpoint.dat';
my $data;
restore_checkpoint() if -e $checkpoint;
STAGE1:
$data->{data_set} = [ title => 'scientist', age => 27, address => '123 street, NY 12345 USA' ];
$data->{next_stage} = 'STAGE2';
store_checkpoint($data);
STAGE2:
$data->{data_set} = [ title => 'professor', age => 45, address => '234 street, WA 98230 USA' ];
$data->{next_stage} = 'STAGE3';
store_checkpoint($data);
STAGE3:
$data->{data_set} = [ title => 'doctor', age => 53, address => '345 street, OK 56789 USA' ];
$data->{next_stage} = 'STAGE4';
store_checkpoint($data);
STAGE4:
$data->{data_set} = [ title => 'CEO', age => 38, address => '456 street, MA 54321 USA' ];
$data->{next_stage} = 'STAGE5';
store_checkpoint($data);
STAGE5:
say 'Done';
sub store_checkpoint {
my $data = shift;
say 'INFO: save_checkpoint ' . $data->{next_stage};
open my $fh, '>', $checkpoint
or die "Couldn't open $checkpoint";
say $fh Dump($data);
close $fh;
}
sub restore_checkpoint {
say 'INFO: restore_checkpoint';
open my $fh, '<', $checkpoint
or die "Couldn't open $checkpoint";
my $yaml = do { local $/; <$fh> };
close $fh;
$data = Load($yaml);
say Dumper($data);
say 'INFO: continue stage ' . $data->{next_stage};
goto $data->{next_stage};
}

Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.
To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb
Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:
abcde => 1
Next line, it moves one character to the right and does the same:
bcdef => 1
It then continues for the rest of the string and prints the 5 most frequent sequences:
open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'
my $seq = <$in>; # read whole file into string
my $len = length($seq);
my $seq_length = 5; # set k-mer length
my %data;
for (my $i = 0; $i <= $len - $seq_length; $i++) {
my $kmer = substr($seq, $i, $seq_length);
$data{$kmer}++;
}
# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
print "$kmer $data{$kmer}\n";
$count++;
last if $count >= 5;
}
ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1
However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.
I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.
My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.
Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?
From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once
This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory
The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer
After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file
I've also use constants for the K-mer size and the chunk size. They are constant after all!
The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort
use strict;
use warnings 'all';
use constant SEQ_LENGTH => 5; # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB
my $in_file = shift // 'in.txt';
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};
my %data;
my $chunk;
my $length = 0;
while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {
$length += $size;
for my $offset ( 0 .. $length - SEQ_LENGTH ) {
my $kmer = substr $chunk, $offset, SEQ_LENGTH;
++$data{$kmer};
}
$chunk = substr $chunk, -(SEQ_LENGTH-1);
$length = length $chunk;
}
my #kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for #kmers[0..4];
output
ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1
Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.
The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.
Even if you don't read the entire file into memory before processing it, you could still run out of memory.
A 10 GiB file contains almost 11E9 sequences.
If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.
If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.
In that case, I suggest doing the following:
Create a file that contains all the sequences of the original file.
The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.
perl -e'
use strict;
use warnings qw( all );
use constant SEQ_LENGTH => 20;
use constant CHUNK_SIZE => 1024 * 1024;
my $buf = "";
while (1) {
my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
die($!) if !defined($size);
last if !$size;
for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
print(substr($buf, $offset, SEQ_LENGTH), "\n");
}
substr($buf, 0, -(SEQ_LENGTH-1), "");
}
' <in.txt >sequences.txt
Sort the sequences.
sort sequences.txt >sorted_sequences.txt
Count the number of instances of each sequeunces, and store the count along with the sequences in another file.
perl -e'
use strict;
use warnings qw( all );
my $last = "";
my $count;
while (<>) {
chomp;
if ($_ eq $last) {
++$count;
} else {
print("$count $last\n") if $count;
$last = $_;
$count = 1;
}
}
' sorted_sequences.txt >counted_sequences.txt
Sort the sequences by count.
sort -rns counted_sequences.txt >sorted_counted_sequences.txt
Extract the results.
perl -e'
use strict;
use warnings qw( all );
my $last_count;
while (<>) {
my ($count, $seq) = split;
last if $. > 5 && $count != $last_count;
print("$seq $count\n");
$last_count = $count;
}
' sorted_counted_sequences.txt
This also prints ties for 5th place.
This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.
sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.
Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.
Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.
So if you can turn this into a regex-based solution that's much better.
Here's an attempt to do that:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
3 ebfeb
2 febfe
2 bfebf
1 gbacb
1 fabcg
I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.
I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.
A slightly altered implementation that does this all in perl:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca
Pretty formatting for the code behind that:
my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
substr($str, 0, 1) = "" if $pos;
$occur{$_}++ for $str =~ m/(.{5})/gs;
}
for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
say "$occur{$k} $k";
}
The most straightforward approach is to use the substr() function:
% time perl -e '$/ = \1048576;
while ($s = <>) { for $i (0..length $s) {
$hash{ substr($s, $i, 5) }++ } }
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
269.79 real 268.92 user 0.66 sys
The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from #Jonathan Leffler, #ÆvarArnfjörðBjarmason, #Vorsprung, #ThisSuitIsBlackNotm #borodin and #ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.
As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. #ikegami's and #Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both #borodin and #ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.
Using Stream::Reader
Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.
use Stream::Reader;
use strict;
use warnings;
open( my $handler, "<", shift );
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } );
my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) {
foreach my $i (0..length $string) {
$hash{ substr($string, $i, 5) }++
}
}
my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.
NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210
The Stream::Reader based script was significantly faster:
time perl sequence_search_stream-reader.pl nucleotide.data
252.12s
time perl sequence_search_borodin.pl nucleotide.data
350.57s
The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:
% wc nucleotide.data
0 0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA
I used this command to create the file:
perl -MString::Random=random_regex -e '
open (my $fh, ">>", "nucleotide.data");
for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'
Lists and Strings
Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.
I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:
use List::Gen 'slide';
$/ = \1048575; # Read a million character/bytes at a time.
my %hash;
while (my $seq = <>) {
chomp $seq;
foreach my $kmer (slide { join("", #_) } 5 => split //, $seq) {
next unless length $kmer == 5;
$hash{$kmer}++;
}
}
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from #Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).
Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.
Postscript and Results
The List::Gen approach was consistently between 5 and 6 times slower than #Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:
1 million character nucleotide string
sequence_search_stream-reader.pl : 0.26s
sequence_search_borodin.pl : 0.39s
sequence_search_listgen.pl : 2.04s
83 million character nucleotide string
With the data in file xaa:
wc xaa
0 1 83886080 xaa
% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
21.33 real 20.95 user 0.35 sys
% time perl sequence_search_borodin.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
28.13 real 28.08 user 0.03 sys
% time perl sequence_search_listgen.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
157.54 real 156.93 user 0.45 sys
1 billion character nucleotide string
In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.
sequence_search_stream-reader.pl : 252.12s
sequence_search_borodin.pl : 350.57s
sequence_search_listgen.pl : 1928.34s
The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.
I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.
1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.
echo "scale=10; 100 * (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

Parallelization of perl script

I have a perl script I wish to parrallelise.
It is composed of a while loop with over 11000 lines inside of another while loop of 3400 lines, which makes it extremely slow.
open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");
while (<FILE1>)
{
my #data=split (/ /, $_);
my $RS=1;
open (FILE2, "File2.txt") or die "Can't open File2";
while (<FILE2>)
{
my #value=split (/ /, $_);
if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
}
close FILE2;
}
I'm looking for a way to do the equivalent of qsub with every line of File1 so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.
My File1 contains a list of ID with information in column. Each column is then related to a single line in File2. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.
File1
ID RS_10 RS_15 RS_30
23 1 0 1
34 2 2 0
45 1 1 0
23 0 0 2
10 2 1 1
File2
RS_10 A B C
RS_15 D E F
RS_30 G H I
The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).
The second rule would probably refer to caching.
The File2 of yours isn't very large. I'd say we load it into memory. This has the following advantages:
We do our parsing once and only once.
The file isn't obscenly large, so space isn't much of an issue.
We can create a data structure that makes lookups very simple.
About that first point: You split each line over three thousand times. Those cycles could have been better spent.
About that third point: you seem to do an index conversion:
1 → 1, 2 → 2, 0 → 3
Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):
my #conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];
If this index conversion is constant, we could do it once and only once when parsing File2. This would look like
use strict; use warnings;
use autodie; # automatic error handling
my #file2;
{
open my $file2, "<", "File2.txt";
while (<$file2>) {
my (undef, #vals) = split;
# do the reordering. This is equivalent to #vals = #vals[2, 0, 1];
unshift #vals, pop #vals;
push #file2, \#vals;
}
}
Now we can move on to iterating through File1. Printing the corresponding entry from File2 now looks like
open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
# but I guess you'd want some separator in between
# If so, set the $, variable
}
This algorithm is still quadratic (the map is just a for-loop in disguise), but this should have a better constant factor. The output of above code given your example input is
23 A F G
34 B E I
45 A D I
23 C F H
10 B D G
(with $, = " "; $\ = "\n").
Where to go from here
This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:
use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores
# parse the File2 data here
my $queue = Thread::Queue->new;
my #threads = map threads->new(\&worker), 1 .. NUM_THREADS;
# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions
# wait for threads to complete
$_->join for #threads;
sub worker {
while(defined(my $_ = $queue->dequeue)) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
}
}
Note that this copies the #file2 into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.
In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie and friends used more time than doing the actual processing. For you, the most expensive line would probably be
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
but there isn't much we can do here inside Perl.

How do I replace row identifiers with sequential numbers?

I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}
This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.
Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;

Strategies to handle a file with multiple fixed formats

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).
I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.
Here's an example (commentary on RHS):
# | Format | Level | Comment
# +--------+-------+---------
**DEVICE 109523.69142 # 1 1 file-specific
.981 561A # 2 1
10/MAY/2010 24.15.30,13.45.03 # 3 2 group of records
05:03:01 AB23X 15.67 101325.72 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
05:03:15 CR22X 16.72 101325.42 # 4 3 new record
* 14 29.16264 11 0 # 5 3
06:23:51 AW41X 15.67 101323.9 # 4 3
* 14 31.26493219 0 # 5 3
11/MAY/2010 24.07.13,13.44.63 # 3 2 group of new records
15:57:14 AB23X 15.67 101327.23 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
15:59:59 CR22X 16.72 101331.88 # 4 3 new record
* 14 29.16264 11 0 # 5
The logic I have at the moment is fragile:
I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).
The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.
Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.
From this, you can infer that #expect is a list of names of formats we expect to find on the current line.
For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:
This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.
This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
After processing the input from your question, $device contains
{
'attr1' => '109523.69142',
'attr2' => '.981',
'attr3' => '561A',
'groups' => [
{
'date' => '10/MAY/2010',
'nnn' => [ '24.15.30', '13.45.03' ],
'records' => [
[ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474', '13', '0' ],
[ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264', '11', '0' ],
[ '06:23:51', 'AW41X', '15.67', '101323.9', '14', '31.264932', '19', '0' ],
],
},
{
'date' => '11/MAY/2010',
'nnn' => [ '24.07.13', '13.44.63' ],
'records' => [
[ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
[ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
],
}
],
}
I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:
Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.
For completeness, a working program demonstrating the result is below.
#! /usr/bin/perl
use warnings;
use strict;
use feature ':5.10';
use re 'eval';
*ARGV = *DATA;
my $device;
my $record;
my #expect = qw/ fmt1 /;
my %pattern;
%pattern = (
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
(?{ #{$device}{qw< attr2 attr3 >} = ($1,$2);
#expect = qw< fmt3 >;
})
/x,
# e.g., 10/MAY/2010 24.15.30,13.45.03
fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
(?{ my($date,$nnns) = ($1,$2);
push #{ $device->{groups} } =>
{ nnn => [ split m|,| => $nnns ],
date => $date };
#expect = qw< fmt4 >;
})
/x,
# e.g., 05:03:01 AB23X 15.67 101325.72
fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
(\S+) \s+ (\S+) \s+ (\S+)
\s*$
(?{ push #{ $device->{groups}[-1]{records} } =>
[ $1, $2, $3, $4 ];
#expect = qw< fmt4 fmt5 >;
})
/x,
# e.g., * 14 31.30474 13 0
fmt5 => qr/^\* \s+ (\d+) \s+
# tricky: possibly no whitespace after 9-char float
((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
(\d+) \s+ (\d+)
\s*$
(?{ push #{ $device->{groups}[-1]{records}[-1] } =>
$1, $2, $3, $4;
#expect = qw< fmt4 fmt3 fmt2 >;
})
/x,
);
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;
__DATA__
**DEVICE 109523.69142
.981 561A
10/MAY/2010 24.15.30,13.45.03
05:03:01 AB23X 15.67 101325.72
* 14 31.30474 13 0
05:03:15 CR22X 16.72 101325.42
* 14 29.16264 11 0
06:23:51 AW41X 15.67 101323.9
* 14 31.26493219 0
11/MAY/2010 24.07.13,13.44.63
15:57:14 AB23X 15.67 101327.23
* 14 31.30474 13 0
15:59:59 CR22X 16.72 101331.88
* 14 29.16264 11 0
This is a good question. Two suggestions occur to me.
(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.
(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.
First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.
And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.
In unbreak_records.pl.
Omitting the she-bang and use strict/warnings.
while (<>){
chomp;
print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";
In add_record_types.pl
while (<>){
next unless /\S/;
my $rt = /^\*/ ? 1 :
/^..\// ? 2 : 3;
print $rt, ' ', $_;
}
At the command line.
./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
Output:
1 **DEVICE 109523.69142 .981 561A
2 10/MAY/2010 24.15.30,13.45.03
3 05:03:01 AB23X 15.67 101325.72 * 14 31.30474 13 0
3 05:03:15 CR22X 16.72 101325.42 * 14 29.16264 11 0
3 06:23:51 AW41X 15.67 101323.9 * 14 31.26493219 0
2 11/MAY/2010 24.07.13,13.44.63
3 15:57:14 AB23X 15.67 101327.23 * 14 31.30474 13 0
3 15:59:59 CR22X 16.72 101331.88 * 14 29.16264 11 0
The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.
Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.
This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).
my $state = 'expect_fmt1';
while (defined $state) {
$state = $object->$state();
}
...
sub expect_fmt1 {
my $self = shift;
# read format 1, parse it, store it in object
return 'expect_fmt2';
}
Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:
If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.
If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:
my get_line {
my $self = shift;
my $cache = $self->{line_cache};
return shift #$cache if #$cache;
return $self->{filehandle}->getline;
}
my unget_line { my $self = shift; unshift #{ $self->{line_cache} }, #_ }
Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).
I would keep an additional state in one or more variables and update it per row.
Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.
What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.
You could also have an alternation which combines #2 and #3:
my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;
If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.
while ( <> ) {
given ( $state ) {
when ( 1 ) { my ( $device_num ) = m/$regex1/; $state++; }
when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
when ( 3 ) {
my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
$state += $cap1 ? 1 : 2;
}
}
}
That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.