Need serious help optimizing script for memory use - perl

[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]
Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.
I've got a script that produces frequency lists. It essentially does the following:
Reads in lines from a file having the format $frequency \t $item. Any given $item may occur multiple times, with different values for $frequency.
Eliminates certain lines depending on the content of $item.
Sums the frequencies of all identical $items, regardless of their case, and merges these entries into one.
Performs a reverse natural sort on the resulting array.
Prints the results to an output file.
The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).
The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.
So I need to somehow optimize the script. And that's what I'm here asking for help with.
Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.
I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.
Thanks in advance!
(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.
#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);
# DEFINE VARIABLES
my $delimiter = "\t";
my $split_char = "\t";
my $input_file_name = "";
my $output_file_name = "";
my $in_basename = "";
my $frequency = 0;
my $item = "";
# READ COMMAND LINE OPTIONS
GetOptions (
"input|i=s" => \$input_file_name,
"output|o=s" => \$output_file_name,
);
# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
die
"\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}
# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {
# STRIP EXTENSION FROM INPUT FILE NAME
$in_basename = $input_file_name;
$in_basename =~ s/(.+)\.(.+)/$1/;
# GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
$output_file_name = "$in_basename.output.txt";
}
# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
or die "\nERROR: Can't open input file ($input_file_name): $!";
# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";
# PROCESS INPUT FILE LINE BY LINE
my %F;
while (<INPUTFILE>) {
chomp;
# PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
( $frequency, $item ) = split( /$split_char/, $_, 2 );
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;
# PROCESS INPUT LINES
$F{ lc($item) } += $frequency;
}
close INPUTFILE;
# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
|| die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";
# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}
close OUTPUTFILE;
exit;
Below is some sample input from the source file. It's tab-separated, and the first column is $frequency, while all the rest together is $item.
2 útil volver a valdivia
8 útil volver la vista
1 útil válvula de escape
1 útil vía de escape
2 útil vía fax y
1 útil y a cabalidad
43 útil y a el
17 útil y a la
1 útil y a los
21 útil y a quien
1 útil y a raíz
2 útil y a uno

UPDATE In my tests a hash takes 2.5 times the memory that its data "alone" would take. However, the program size for me is consistently 3-4 times as large as its variables. This would turn 6.3Gb data file into a ~ 15Gb hash, for a ~ 60Gb program, just as reported in comments.
So 6.3Gb == 60Gb, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.
There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.
Here is a simple re-write of the core of the program, to try first.
# ... set filenames, variables
open my $fh_in, '<:encoding(utf8)', $input_file_name
or die "\nERROR: Can't open input file ($input_file_name): $!";
my %F;
while (<$fh_in>) {
chomp;
s/^\s*//; #/trim leading space
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
# ... increment counters and aggregates and add to hash
# (... any other processing?)
$F{ lc($item) } += $frequency;
}
close $fh_in;
# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
foreach my $item ( sort {
$F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b
} keys %F )
{
print $fh_out $F{$item}, "\t", $item, "\n";
}
close $fh_out;
A few comments, let me know if more is needed.
Always add $! to error-related prints, to see the actual error. See perlvar.
Use lexical filehandles (my $fh rather than IN), it's better.
If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ... (but it doesn't hurt either).
The sort here has to at least copy its input, and with multiple conditions more memory is needed.
That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.
Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.
Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.
But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.
Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
my $cumulative_freq;
while (<$fh_in>) {
chomp;
s/^\s*//; #/ leading only
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
$cumulative_freq += $frequency; # would-be hash value
# Add a sort criterion, $item's length, helpful for later sorting
say $fh_out $cumulative_freq, "\t", length $item, "\t", lc($item);
#say $fh_out $cumulative_freq, "\t", lc($item);
}
close $fh_out;
Now we can use the system's sort, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key, run in a terminal
sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3- > result
The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut which pulls out the first and third fields from STDIN (with tab as default delimiter), what is the needed result.
A systemic solution is to use a database, and a very convenient one is DBD::SQLite.
I used Devel::Size to see memory used by variables.

Sorting input requires keeping all input in memory, so you can't do everything in a single process.
However, sorting can be factored: you can easily sort your input into sortable buckets, then process the buckets, and produce the correct output by combining the outputs in reversed-sorted bucket order. The frequency counting can be done per bucket as well.
So just keep the program you have, but add something around it:
partition your input into buckets, e.g. by the first character or the first two characters
run your program on each bucket
concatenate the output in the right order
Your maximum memory consumption will be slightly more than what your original program takes on the biggest bucket. So if your partitioning is well chosen, you can arbitrarily drive it down.
You can store the input buckets and per-bucket outputs to disk, but you can even connect the steps directly with pipes (creating a subprocess for each bucket processor) - this will create a lot of concurrent processes, so the OS will be paging like crazy, but if you're careful, it won't need to write to disk.
A drawback of this way of partitioning is that your buckets may end up being very uneven in size. An alternative is to use a partitioning scheme that is guaranteed to distribute the input equally (e.g. by putting every nth line of input into the nth bucket) but that makes combining the outputs more complex.

Related

Reading lines of a file into a hash parallel in Perl

I have thousands of files. My goal is to insert the lines of those files into a hash (Big amount of those lines repeats).
For now, I iterate through an array on files and for each file, I open it and split the row (Because each row is in the following format: <path>,<number>).
Then I insert into the %paths hash. Also each line I write into one main file (trying to save time by combining).
Piece of my code:
open(my $fh_main, '>', "$main_file") or die;
foreach my $dir (#dirs)
{
my $test = $dir."/"."test.csv";
open(my $fh, '<', "$test") or die;
while (my $row = <$fh>)
{
print $fh_main $row;
chomp($row);
my ($path,$counter) = split(",",$row);
my $abs_path = abs_path($path);
$paths{$abs_path} += $counter;
}
close ($fh);
}
close ($fh_main);
Due to a lot of files, I would like to split the iteration at least half. I thought of using the Parallel::ForkManager module (link),
in order to parallel insert the files into a hash A and into a hash B (if possible, then more than two hashes).
Then I can combine those two (or more) hashes into one main hash. There should not be a memory issue (because I'm running on a machine that does not have memory issues).
I read the decontamination but every single try failed and each iteration was running alone. I would like to see an initial example of the should I solve this issue.
Also, I would like to hear another opinion on how to implement this in a more clean and wise way.
Edit: maybe I didn't understand what exactly the module do. I would like to create a fork in the script so one half will of the files will be collected by process 1 and the other half will be collected by process 2. The first one to finish will write to a file and the other one will read from it. Is it possible to implement? Will it reduce the run time?
Try MCE::Map. It will automatically gather the output of the sub-processes into a list, which in your case can be a hash. Here's some untested pseudocode:
use MCE::Map qw[ mce_map ];
# note that MCE passes the argument via $_, not #_
sub process_file {
my $file = $_;
my %result_hash;
... fill hash ...
return %result_hash
}
my %result_hash = mce_map \&process_file \#list_of_files

Parsing huge text file in Perl

I have genome file something about 30 gb similar to under below ,
>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC
I am trying to parse the file and achieve my task fast ,
using the below code character by character
but the character is not getting printed
open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";
until ( eof(FH) ) {
$ch = getc(FH);
print "$ch\n";# not printing ch
}
close FH;
Your mistake is forgetting an eof:
until (eof FH) { ... }
But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.
Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.
Either pick a natural record delimiter (like \n), or read a certain number of bytes:
local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
# do something with the bytes
}
(see perlvar)
You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.
# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;
my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
print($char .= "\n"); # appending should be better than concatenation.
}
If we are gone that far, using Inline::C is just a small and possibly preferable step.

Sorting Huge Hashes in Perl

I am analyzing the frequency of occurrence of groups of words which occur together in a sentence.
Each group consists of 3 words and we have to calculate their frequency.
Example: This is a good time to party because this is a vacation time.
Expected Output:
this is a - 2
is a good - 1
a good time - 1
and so on.
I have written a script which works good and it prints the frequency and sorts it by descending order.
It works by reading one line at a time from the file. For each line, it will convert them to lowercase, split it into individual words and then form an array out of it.
Then, we pick 3 words at a time starting from the left and keep forming a hash storing the count of occurrence. Once done, we shift the left most element in the array and repeat till the time our array consists of more than 3 words.
Question Updated:
The problem is I want to use this script on a file consisting of more than 10 million lines.
After running some tests I observed that it will not work if the number of lines in the input file are more than 400K.
How can I make this script more memory efficient?
Thanks to fxzuz for his suggestions but now I want to make this script work with larger files :)
#!/usr/bin/perl
use strict;
use warnings;
print "Enter the File name: ";
my $input = <STDIN>;
chomp $input;
open INPUT, '<', $input
or die("Couldn't open the file, $input with error: $!\n");
my %c;
while (my $line = <INPUT>) {
chomp $line;
my #x = map lc, split /\W+/, join "", $line;
while (#x>3) {
$c{"#x[0..2]"}++;
shift #x;
}
}
foreach $key (sort {$c{$b} <=> $c{$a}} keys %c) {
if($c{$key} > 20) {
print $key." - ".$c{$key}."\n";
}
}
close INPUT;
This works good and it will print the groups of words in descending order of frequency. It will only print those groups of words which occur more than 20 times.
Now, how do I make this work on a file consisting of more than 1 million or 10 million lines?
I also checked the memory and CPU usage of perl while running this script using top command in Linux and observed that the CPU usage reaches 100% and the memory usage is close to 90% while the script runs on a file consisting of 400K lines.
So, it is not feasible to make it work with a file consisting of 1 million lines. Because the perl process will hang.
How can I make this code more memory efficient?
Apparently, your code is written correctly and will work, but only as long as your data set is not VERY big. If you have a lot of input data (and seems like you DO), it is possible that sorting phase may fail due to lack of memory. If you cannot increase your memory, the only solution is to write your data to disk - in text or database format.
Text format: you can simply write your triplets as you go into text file, one line per triplet. Doing this will increase output size by factor of 3, but it should be still manageable. Then, you can simply use command line gnu sort and uniq tools to get your desirable counts, something like this:
text2triplet.pl <input.txt | sort | uniq -c | sort -r | head -10000
(you may want to store your output for sort into a file and not pipe it if it is very big)
Database format: use DBD::SQLite and create table like this:
CREATE TABLE hash (triplet VARCHAR, count INTEGER DEFAULT 0);
CREATE INDEX idx1 ON hash (triplet);
CREATE INDEX idx2 ON hash (count);
INSERT your triplets into that table as you go, and increase count for duplicates. After data is processed, simply
SELECT * FROM hash
WHERE count > 20
ORDER BY count DESC
and print it out.
Then you can DROP your hash table or simply delete whole SQLite database altogether.
Both of these approaches should allow you to scale to almost size of your disk, but database approach may be more flexible.
You have some problems with declare and using variables. Please add pragma use strict to your script. Use local variable when your work with hash in for block and other. I notice that you have statement if($c{$key} > 20), but hash values <= 2.
#!/usr/bin/perl
use strict;
my %frequency;
while (my $line = <DATA>) {
chomp $line;
my #words = map lc, split /\W+/, $line;
while (#words > 3) {
$frequency{"#words[0,1,2]"}++;
shift #words;
}
}
# sort by values
for my $key (sort {$frequency{$b} <=> $frequency{$a}} keys %frequency) {
printf "%s - %s\n", $key, $frequency{$key};
}
__DATA__
This is a good time to party because this is a vacation time.
OUTPUT
this is a - 2
to party because - 1
is a good - 1
time to party - 1
party because this - 1
because this is - 1
good time to - 1
is a vacation - 1
a good time - 1

get last few lines of file stored in variable

How could I get the last few lines of a file that is stored in a variable? On linux I would use the tail command if it was in a file.
1) How can I do this in perl if the data is in a file?
2) How can I do this if the content of the file is in a variable?
To read the end of a file, seek near the end of the file and begin reading. For example,
open my $fh, '<', $file;
seek $fh, -1000, 2;
my #lines = <$fh>;
close $fh;
print "Last 5 lines of $file are: ", #lines[-5 .. -1];
Depending on what is in the file or how many lines you want to look at, you may want to use a different magic number than -1000 above.
You could do something similar with a variable, either
open my $fh, '<', \$the_variable;
seek $fh, -1000, 2;
or just
open my $fh, '<', \substr($the_variable, -1000);
will give you an I/O handle that produces the last 1000 characters in $the_variable.
The File::ReadBackwards module on the CPAN is probably what you want. You can use it thus. This will print the last three lines in the file:
use File::ReadBackwards
my $bw = File::ReadBackwards->new("some_file");
print reverse map { $bw->readline() } (1 .. 3);
Internally, it seek()s to near the end of the file and looks for line endings, so it should be fairly efficient with memory, even with very big files.
To some extent, that depends how big the file is, and how many lines you want. If it is going to be very big you need to be careful, because reading it all into memory will take a lot longer than just reading the last part of the file.
If it is small. the easiest way is probably to File::Slurp it into memory, split by record delimiters, and keep the last n records. In effect, something like:
# first line if not yet in a string
my $string = File::Slurp::read_file($filename);
my #lines = split(/\n/, $string);
print join("\n", #lines[-10..-1])
If it is large, too large to find into memory, you might be better to use file system operations directly. When I did this, I opened the file and used seek() and read the last 4k or so of the file, and repeated backwards until I had enough data to get the number of records I needed.
Not a detailed answer, but the question could be a touch more specific.
I know this is an old question, but I found it while looking for a way to search for a pattern in the first and last k lines of a file.
For the tail part, in addition to seek (if the file is seekable), it saves some memory to use a rotating buffer, as follows (returns the last k lines, or less if fewer than $k are available):
my $i = 0; my #a;
while (<$fh>) {
$a[$i++ % $k] = $_;
}
my #tail = splice #a,0,$i % $k;
splice #a,#a,0,#tail;
return #a;
A lot has already been stated on the file side, but if it's already in a string, you can use the following regex:
my ($lines) = $str ~= /
(
(?:
(?:(?<=^)|(?<=\n)) # match beginning of line (separated due to variable lookbehind limitation)
[^\n]*+ # match the line
(?:\n|$) # match the end of the line
){0,5}+ # match at least 0 and at most 5 lines
)$ # match must be from end of the string
/sx # s = treat string as single line
# x = allow whitespace and comments
This runs extremely fast. Benchmarking shows between 40-90% faster compared to the split/join method (variable due to current load on machine). This is presumably due to less memory manipulations. Something you might want to consider if speed is essential. Otherwise, it's just interesting.

Searching/reading another file from awk based on current file's contents, is it possible?

I'm processing a huge file with (GNU) awk, (other available tools are: Linux shell tools, some old (>5.0) version of Perl, but can't install modules).
My problem: if some field1, field2, field3 contain X, Y, Z I must search for a file in another directory which contains field4, and field5 on one line, and insert some data from the found file to the current output.
E.g.:
Actual file line:
f1 f2 f3 f4 f5
X Y Z A B
Now I need to search for another file (in another directory), which contains e.g.
f1 f2 f3 f4
A U B W
And write to STDOUT $0 from the original file, and f2 and f3 from the found file, then process the next line of the original file.
Is it possible to do it with awk?
Let me start out by saying that your problem description isn't really that helpful. Next time, please just be more specific: You might be missing out on much better solutions.
So from your description, I understand you have two files which contain whitespace-separated data. In the first file, you want to match the first three columns against some search pattern. If found, you want to find all lines in another file which contain the fourth and and fifth column of the matching line in the first file. From those lines, you need to extract the second and third column and then print the first column of the first file and the second and third from the second file. Okay, here goes:
#!/usr/bin/env perl -nwa
use strict;
use File::Find 'find';
my #search = qw(X Y Z);
# if you know in advance that the otherfile isn't
# huge, you can cache it in memory as an optimization.
# with any more columns, you want a loop here:
if ($F[0] eq $search[0]
and $F[1] eq $search[1]
and $F[2] eq $search[2])
{
my #files;
find(sub {
return if not -f $_;
# verbatim search for the columns in the file name.
# I'm still not sure what your file-search criteria are, though.
push #files, $File::Find::name if /\Q$F[3]\E/ and /\Q$F[4]\E/;
# alternatively search for the combination:
#push #files, $File::Find::name if /\Q$F[3]\E.*\Q$F[4]\E/;
# or search *all* files in the search path?
#push #files, $File::Find::name;
}, '/search/path'
)
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open file '$file': $!";
while (defined($_ = <$fh>)) {
chomp;
# order of fields doesn't matter per your requirement.
my #cols = split ' ', $_;
my %seen = map {($_=>1)} #cols;
if ($seen{$F[3]} and $seen{$F[4]}) {
print join(' ', $F[0], #cols[1,2]), "\n";
}
}
close $fh;
}
} # end if matching line
Unlike another poster's solution which contains lots of system calls, this doesn't fall back to the shell at all and thus should be plenty fast.
This is the type of work that got me to move from awk to perl in the first place. If you are going to accomplish this, you may actually find it easier to create a shell script that creates awk script(s) to query and then update in separate steps.
(I've written such a beast for reading/updating windows-ini-style files - it's ugly. I wish I could have used perl.)
I often see the restriction "I can't use any Perl modules", and when it's not a homework question, it's often just due to a lack of information. Yes, even you can use CPAN contains the instructions on how to install CPAN modules locally without having root privileges. Another alternative is just to take the source code of a CPAN module and paste it into your program.
None of this helps if there are other, unstated, restrictions, like lack of disk space that prevent installation of (too many) additional files.
This seems to work for some test files I set up matching your examples. Involving perl in this manner (interposed with grep) is probably going to hurt the performance a great deal, though...
## perl code to do some dirty work
for my $line (`grep 'X Y Z' myhugefile`) {
chomp $line;
my ($a, $b, $c, $d, $e) = split(/ /,$line);
my $cmd = 'grep -P "' . $d . ' .+? ' . $e .'" otherfile';
for my $from_otherfile (`$cmd`) {
chomp $from_otherfile;
my ($oa, $ob, $oc, $od) = split(/ /,$from_otherfile);
print "$a $ob $oc\n";
}
}
EDIT: Use tsee's solution (above), it's much more well-thought-out.