Read file into chunks and process the chunks in parallel - perl

So, basically I want to read a file into a hash, but since the file is huge and doesn't fit into RAM I split it into chunks, process the data (the search_f2 sub) and read the next chunk of data. This seem to work, but of course it takes only one core.
Is there an easy way to fork the search_f2 sub?
I've tried a naive way with Parallel::Forkmanager but it doesn't work as far as I see.
Any hint how to achieve that? I actually don't need to return from the forked sub, it would be sufficient if it prints the result to STDOUT.
The file1 structure is the following ( basically the result of tar -tf command):
tarfile1.tar
gzip1.gz
<skip>
gzipX.gz
<skip>
tarfileX.tar
<some random number of gz files>
the file2 is just plain line break separated list of gzipX.gz files
the perl code:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use Data::Dumper;
use Parallel::ForkManager;
my $file1 = $ARGV[0] // die "Need a file as argument";
my $file2 = $ARGV[1] // die "Need a file as argument";
my $fd1 = read_f($file1);
my %hdata;
my $tarfile;
my $index = 0;
my $pm = Parallel::ForkManager->new(10);
while (my $line = <$fd1>) {
chomp $line;
if ( $line =~ m/^somepattern.*tar$/ ){
$tarfile = $line;
$index++;
}
if (++$index >= '100') {
say "\tForking a process";
my $pid = $pm->start and do {
$index = 0;
%hdata = ();
next;
};
search_f2(\%hdata);
$pm->finish;
}
push #{$hdata{$tarfile}},$line if $line =~ m/.*\.gz$/;
}
close $fd1;
#last search
search_f2(\%hdata);
sub search_f2{
my ($h) = #_;
my %hdata = %$h;
my $fd2 = read_f($file2);
while (my $ciffile = <$fd2>) {
chomp $ciffile;
foreach my $tarfile (keys %hdata) {
my $values = $hdata{$tarfile};
if (grep (/$ciffile/, #$values)) {
say "$tarfile";
delete $hdata{$tarfile};
last;
}
}
}
close $fd2;
return;
}
sub read_f {
my $file = shift;
die "Can't open file $file: $!\n" if ! -e $file;
# gzip happily parses plain files as well
open my $fh, "pigz -fdc $file|" or die "Can't open file $file: $!\n";
return $fh if $fh;
}

I take the quest to be the following: read a certain number of lines from a file and process each such chunk of text in its own fork. I'm not quite sure of some details in the question so here is a basic demo, hopefully to serve as a template.
Keep the number of processes to 3, and process batches ("chunks") of 2 lines in each.
use warnings;
use strict;
use feature qw(say state);
use Parallel::ForkManager;
my $file = shift // die "Usage: $0 filename\n";
my $pm = Parallel::ForkManager->new(3);
open my $fh, '<', $file or die $!;
my ($chunk, $num_lines);
while (my $line = <$fh>) {
chomp $line;
say "Processing line: |$line|";
$chunk .= $line;
if (++$num_lines >= 2) {
say "\tForking a process";
$pm->start and do {
$num_lines = 0;
$chunk = '';
next;
};
proc_chunk($chunk);
$pm->finish;
}
}
$pm->wait_all_children;
sub proc_chunk {
my ($chunk) = #_;
my $line_nos = join ' ', $chunk =~ /#([0-9]+)/g;
say "\t\tin a fork, processing chunk with lines: $line_nos";
sleep 10;
say "\t\t\t... done with fork";
}
In P::FM, that $pm->start and next; forks a process and the parent immediately jumps to the next iteration of the loop. So any needed resetting of variables need be done right there and I use a do { ... } block for it.†
The sub sleeps so that we can see the group of forks exiting practically together, what happens because the processing here is so quick. (In fact P::FM forks a new process as soon as one is finished, to keep the given number running. It doesn't wait for the whole batch to finish first, unless wait_for_available_procs is set up, see an example and a lot more detail here.)
This prints
Processing line: |This is line #1|
Processing line: |This is line #2|
Forking a process
Processing line: |This is line #3|
Processing line: |This is line #4|
Forking a process
Processing line: |This is line #5|
Processing line: |This is line #6|
Forking a process
Processing line: |This is line #7|
Processing line: |This is line #8|
Forking a process
in a fork, processing chunk with lines: 1 2
in a fork, processing chunk with lines: 3 4
in a fork, processing chunk with lines: 5 6
... done with fork
... done with fork
... done with fork
Processing line: |This is line #9|
Processing line: |This is line #10|
Forking a process
Processing line: |This is line #11|
Processing line: |This is line #12|
Forking a process
Processing line: |This is line #13|
Processing line: |This is line #14|
Forking a process
in a fork, processing chunk with lines: 7 8
in a fork, processing chunk with lines: 9 10
in a fork, processing chunk with lines: 11 12
... done with fork
... done with fork
... done with fork
Processing line: |This is line #15|
Processing line: |This is line #16|
Forking a process
Processing line: |This is line #17|
Processing line: |This is line #18|
Forking a process
Processing line: |This is line #19|
in a fork, processing chunk with lines: 13 14
Processing line: |This is line #20|
Forking a process
in a fork, processing chunk with lines: 15 16
in a fork, processing chunk with lines: 17 18
^C
[...etc...]
† The original version of the question didn't have that do { } block but had ... and next, thus with parent directly bailing out. (I see that the question was now edited to include it.)

Related

skipping first x and last y lines of file

I'm doing some simple parsing on text files (which could get up into the 1GB range). How would I go about skipping the first N rows, and more importantly, the last (different) N rows? I'm sure I could open the file and count the rows, and do something with $_ < total_row_count -N, but that seems incredibly inefficient.
I'm pretty much a perl newb, btw.
A file is a sequence of bytes, without the notion of "lines." Some of those bytes are considered as "line" separators (linefeeds), which is how software gives us our "logical" lines to work with. So there is no way to know how many lines there are in a file -- without having read it and counted them, that is.
A simple and naive way is to read line-by-line and count
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
++$cnt while <$fh>;
with a little faster version using $. variable
1 while <$fh>;
my $cnt = $.;
These take between 2.5 and 3 seconds for a 1.1 Gb text file on a reasonable desktop.
We can speed this up a lot by reading in larger chunks and counting newline characters
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
NUM_LINES: {
my $len = 64_000;
my $buf;
$cnt += $buf =~ tr/\n//
while read $fh, $buf, $len;
seek $fh, 0, 0;
};
This goes in barely over half a second, on same hardware and Perl versions.
I've put it in a block to scope unneeded variables but it should be in a sub, where you can then check where the filehandle is when you get it and return it there after counting (so we can count the "rest" of lines from some point in the file and the processing can then continue), etc. It should also include checks on read operation, at each invocation.
I'd think that a half a second overhead on a Gb large file isn't that bad at all.
Still, you can go for faster yet, at the expense of it being far messier. Get the file size (metadata, so no reading involved) and seek to a position estimated to be the wanted number of lines before the end (no reading involved). That most likely won't hit the right spot so read to the end to count lines and adjust, seeking back (further or closer). Repeat until you reach the needed place.
open my $fh, "<", $file;
my $size = -s $file;
my $estimated_line_len = 80;
my $num_last_lines = 100;
my $pos = $size - $num_last_lines*$estimated_line_len;
seek $fh, $pos, 0;
my $cnt;
++$cnt while <$fh>;
say "There are $cnt lines from position $pos to the end";
# likely need to seek back further/closer ...
I'd guess that this should get you there in under 100 ms. Note that $pos is likely inside a line.
Then once you know the number of lines (or the position for desired number of lines before the end) do seek $fh, 0, 0 and process away. Or really have this in a sub which puts the filehandle back where it was before returning, as mentioned.
I think you need a circular buffer to avoid reading entire file on your memory.
skip-first-last.pl
#!/usr/bin/perl
use strict;
use warnings;
my ($first, $last) = #ARGV;
my #buf;
while (<STDIN>) {
my $mod = $. % $last;
print $buf[$mod] if defined $buf[$mod];
$buf[$mod] = $_ if $. > $first;
}
1;
Skip first 5 lines and last 2 lines:
$ cat -n skip-first-last.pl | ./skip-first-last.pl 5 2
6
7 my #buf;
8 while (<STDIN>) {
9 my $mod = $. % $last;
10 print $buf[$mod] if defined $buf[$mod];
11 $buf[$mod] = $_ if $. > $first;
12 }

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

Unexpected behaviour of filehandle in conjunction with subroutine call - why?

I am reading a file line by line and want to process each line through a subroutine. Since I am not interested in the line itself, I put the read from the filehandle directly into the subroutine call. However, this leads to unexpected behaviour I don't quite understand.
I created a minimal example demonstrating the effect:
#!/usr/bin/perl
use strict;
use warnings;
use Carp;
use English qw( -no_match_vars );
print "This works as expected:\n";
open my $TEST1, '<', 'filetest1.txt' or croak "Can't open filetest1.txt - $ERRNO";
my $line1 = <$TEST1>;
print_line( $line1 );
while ( 1 ) {
last if eof $TEST1;
$line1 = <$TEST1>;
print $line1;
}
close $TEST1;
print "\n";
print "Unexpected effect here:\n";
open my $TEST2, '<', 'filetest1.txt' or croak "Can't open filetest1.txt - $ERRNO";
print_line(<$TEST2>); # this works, but just once
while ( 1 ) {
last if eof $TEST2; # is triggered immediately
print "This is never printed\n";
print_line(<$TEST2>);
}
close $TEST2;
sub print_line {
my $line = shift;
print $line;
return 1;
}
Content of filetest1.txt:
test line 1
test line 2
test line 3
test line 4
test line 5
Result of the script:
This works as expected:
test line 1
test line 2
test line 3
test line 4
test line 5
Unexpected effect here:
test line 1
It seems that when I read the next line directly in the subroutine call, it works exactly once and then eof is triggered. I haven't found any explanation for that effect, and it forces me to create the variable $line1 just to pass the line to the subroutine.
I'm looking for an explanation why that happens, and consequently I'd like to understand the limits of what I can or cannot do with a filehandle.
In your code print_line(<$FH>); the reading from the filehandle will be will be done in wantarray-mode meaning you don't read a single line but the whole file. And in your subroutine you just use the first line and discard the rest. Thats why the filehandle is empty after your first loop encounter.
You could just provide the filehandle to the subroutine and read a line there:
open my $FH , '<' , 'somefile' or die "Cannot read: $!" ;
while( ! eof $FH ) {
print_line( $FH ) ;
}
sub print_line {
my ( $fh ) = #_ ;
my $line = <$fh> ;
print $line ;
}
You have a context problem. The $TEST1 read is in scalar context, so it only read one line. The $TEST2 read is in list context, so all the lines from the file are read and print_line() is called with a list of them as its arguments. So the second time you try to read from $TEST2 you get EOF, since all the lines were read the first time.

Step back by line in file

I'm iterating through a file and after some condition I have to step back by a line
when file line match the regexp, the second while loop goes in and it iterates over a file until it match while's condition, after than my code have to STEP BACK by 1 line!
while(my $line = <FL>){
if($line =~ /some regexp/){
while($line =~ /^\+/){
$line = <FL>; #Step into next line
}
seek(FL, -length($line), 1); #This should get me back the previous line
#Some tasks with previous line
}
}
actually seek should work but it doesn't, it return me the same line... What is the problem?
When you read from a filehandle, it has already advanced to the next line. Therefore if you go back the length of the current line, all you're doing is setting up to read the line over again.
Also, relating the length of a line to its length on disk assumes the encoding is :raw instead of :crlf or some other format. This is a big assumption.
What you need are state variables to keep track of your past values. There is no need to literally roll back a file handle.
The following is a stub of what you might be aiming to do:
use strict;
use warnings;
my #buffer;
while (<DATA>) {
if (my $range = /some regexp/ ... !/^\+/) {
if ($range =~ /E/) { # Last Line of range
print #buffer;
}
}
# Save a buffer of last 4 lines
push #buffer, $_;
shift #buffer if #buffer > 4;
}
__DATA__
stuff
more stuff
some regexp
+ a line
+ another line
+ last line
break out
more stuff
ending stuff
Output:
some regexp
+ a line
+ another line
+ last line
What about something like: (as an alternative)
open(my $fh, '<', "$file") or die $!;#use three argument open
my $previous_line = q{}; #initially previous line would be empty
while(my $current_line = <$fh>){
chomp;
print"$current_line\n";
print"$previous_line\n";
#assign current line into previous line before it go to next line
$previous_line = $current_line;
}
close($fh);

Storing large file in array is not possible?

There is a process which stores the file in an array. Unfortunately when the file is too big (Let's say 800K lines or more than 60 MB) an error is returned like "Out of memory!". Is there any solution to this? For example the following code throw "Out of memory!".
#! /usr/bin/perl
die unless (open (INPUT, "input.txt"));
#file=<INPUT>; # It fails here
print "File stored in array\n"; # It never reaches here
$idx=0;
while ($idx < #file) {
$idx++;
}
print "The line count is = $idx\n";
I'd use Tie::File for that:
use Tie::File;
my #file
tie #file, 'Tie::File', "input.txt";
print "File reflected in array\n";
print "The line count is ", scalar(#file);
Most of the time, you don't need to read in the whole file at once. The readline operator returns only one line at a time when called in scalar context:
1 while <INPUT>; # read a line, and discard it.
say "The line count is = $.";
The $. special variable is the line number of the last read filehandle.
Edit: Line counting was just an example
Perl has no problem with large arrays, it just seems that your system doesn't have enough memory. Be aware that Perl arrays use more memory than C arrays, as a scalar allocate additional memory for flags etc, and because arrays grow in increasing steps.
If memory is an issue, you have to transform your algorithm from one that has to load a whole file into memory to one that only keeps one line at a time.
Example: Sorting a multi-gigabyte file. The normal approach print sort <$file> won't work here. Instead, we sort portions of the file, write them to tempfiles, and then switch between the tempfiles in a clever way to produce one sorted output:
use strict; use warnings; use autodie;
my $blocksize = shift #ARGV; # take lines per tempfile as command line arg
mkdir "/tmp/$$"; # $$ is the process ID variable
my $tempcounter = 0;
my #buffer;
my $save_buffer = sub {
$tempcounter++;
open my $tempfile, ">", "/tmp/$$/$tempcounter";
print $tempfile sort #buffer;
#buffer = ();
};
while (<>) {
push #buffer, $_;
$save_buffer->() if $. % $blocksize == 0;
}
$save_buffer->();
# open all files, read 1st line
my #head =
grep { defined $_->[0] }
map { open my $fh, "<", $_; [scalar(<$fh>), $fh] }
glob "/tmp/$$/*";
# sort the line-file pairs, pick least
while((my $least, #head) = sort { $a->[0] cmp $b->[0] } #head){
my ($line, $fh) = #$least; print $line;
# read next line
if (defined($line = <$fh>)){
push #head, [$line, $fh];
}
}
# clean up afterwards
END {
unlink $_ for glob "/tmp/$$/*";
rmdir "/tmp/$$";
}
Could be called like $ ./sort-large-file 10000 multi-gig-file.txt >sorted.txt.
This general approach can be applied to all kinds of problems. This is a “divide and conquer” strategy: If the problem is too big, solve a smaller problem, and then combine the pieces.