process multiple files in parallel - perl

I have a Perl script which reads two files and processes them.
The first file - info file - I store it as a hash (3.5 gb)
The second file - taregt file - I am processing by using information from the info file and other subroutines as designed. (This file, target, ranges from 30 - 60 gb)
So far working are:
reading the info file into a hash
breaking the target file into
chunks
I want to run on all chunks in parallel:
while(chunks){
# do something
sub a {}
sub b {}
}
So basically, I want to read a chunk, write its output and do this for multiple chunks at the same time. The while loop reads each line of a chunk file, and calls on various subroutine for processing.
Is there a way that I can read chunks in background?
I don't want to read info file for every chunk as it is 3.5gb long and I am reading it into hash, which takes up 3.5gb everytime.
Right now the script takes 1 - 2hrs to run for 30-60gb.

You can try using Perl threads if parallel tasks are independent.

A 3.5GB hash is very big, you should consider using a database instead. Depending on how you do this, you can keep accessing the database via the hash.
If memory were a non-issue, forking would be the easiest solution. However, this duplicates the process, including the hash, and would only result in unneccessary swapping.
If you cannot free some memory, you should consider to use threads. Perl threads only live inside the interpreter and are invisible to the OS. These threads have a similar feel to forking, however, you can declare variables as :shared. (You have to use threads::shared)
See the official Perl threading tutorial

What's about module File::Map (memory mapping), it can easy read big files.
use strict;
use File::Map qw(map_file);
map_file my $map, $ARGV[0]; # $ARGV[0] - path to your file
# Do something with $map

Related

Share hash across Perl scripts

Is it possible to share a hash created by a Perl script by another Perl script on a Linux machine ?
./hash_script.pl # Creates a hash after parsing a file
# Takes several minutes and hash consumes 4Gb of memory
./script1.pl # Reads hash
./script2.pl # Reads hash
I want to create the hash once and use it many times,whenever script1.pl and script2.pl are run.
If your hash_script script dumps its hash into a file somewhere (using Data::Dumper or some other means), you can load that hash in a subsequent script with do.
In script1/script2:
our %sharedhash; #whatever name the hash has in the dumped file
do 'hash_dump_file.txt' or die "Couldn't read hash: $#";
print $sharedhash{stuff};
I would recommend using the Perl module Storable. Storable can take any data structure and store it onto a disk.
use Storable; # It automatically imports all functions. Grrr...
...
store \%hash, $file_name;
However, if this is a 4Gb file, it probably is way too big to be used effectively for a Perl hash. This is why other posts are recommending you to use SQL or a NoSQL database. A hash would have to keep the entire file in memory and attempt to manipulate it. A SQL or NoSQL database could pull up the file that's required.
However, try Storable, and see how long it takes.
You don't say why you need such a big hash in memory, but probably use of some NoSQL database would be more suitable.
Take a look at Redis or MongoDB.
How about creating package and loading it in other scripts? But if it's size is about 4Gb, it's too much for this approach. Next solution could be memcached or something like that.
Could you write other information about your hash? How are you using it in the other two scripts?
dbmopen / dbmclose allows you to have a normal hash implemented by means of a file on your drive. I've never tried reading a DBM from a different script than the one that created it, but I see no reason why it shouldn't work.

Print unique lines of a 10GB file

I have a 10GB file with 200 million lines. I need to get unique lines of this file.
My code:
while(<>) {
chomp;
$tmp{$_}=1;
}
#print...
I only have 2GB memory. How can I solve this problem?
As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file.
use DBM::Deep;
tie my %lines, 'DBM::Deep', 'data.db';
while(<>) {
chomp;
$lines{$_}=1;
}
This is basically what you already had, but the hash is now a database tied to a file (here data.db) rather than kept in memory.
If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep):
sort -u file
In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.
One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle.
Perl can use the BerkeleyDB module to talk with a BDB database. In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database.
BDB is pretty robust. Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.
The only other thing I can think of is using an SQL database which would be slower and much more complex.
Addendum
Maybe I'm over thinking this...
I decided to try a simple hash. Here's my program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant DIR => "/usr/share/dict";
use constant WORD_LIST => qw(words web2a propernames connectives);
my %word_hash;
for my $count (1..100) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be. Each and every key will be unique. The amount of memory will be massive. Yet...
It worked. It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much. I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash.
I'm now downloading DBD from Oracle, compiling it, and installing it. I'll try the same program using DBD and see what happens.
Using a BDB Database
After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to:
Download Berkeley DB from Oracle, and you need an Oracle account. I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.
Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward.
Running CPAN crashed. Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB.5.3. Creating a link fixed the issue.
All told, about 1/2 an hour getting BerkeleyDB installed. Once installed, modifying my program was fairly straight forward:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use BerkeleyDB;
use constant {
DIR => "/usr/share/dict",
BDB_FILE => "bdb_file",
};
use constant WORD_LIST => qw(words web2a propernames connectives);
unlink BDB_FILE if -f BDB_FILE;
our %word_hash;
tie %word_hash, "BerkeleyDB::Hash",
-Filename => BDB_FILE,
-Flags => DB_CREATE
or die qq(Cannot create DBD_Database file ") . BDB_FILE . qq("\n);
for my $count (1..10) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
All I had to do was add a few lines.
Running the program was a disappointment. It wasn't faster, but much, much slower. It took over 2 minutes while using a pure hash took a mere 13 seconds.
However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte. Instead, it created a 20MB database file.
But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes). Now, if your program wants more memory than is available, it simply is given virtual memory.
So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process. And, it was slow.
Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster. Why do the work when the OS will handle it for you?
The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it. If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.
You might consider calculating a hash code for each line, and keeping track of (hash, position) mappings. You wouldn't need a complicated hash function (or even a large hash) for this; in fact, "smaller" is better than "more unique", if the primary concern is memory usage. Even a CRC, or summing up the chars' codes, might do. The point isn't to guarantee uniqueness at this stage -- it's just to narrow the candidate matches down from 200 million to a few dozen.
For each line, calculate the hash and see if you already have a mapping. If you do, then for each position that maps to that hash, read the line at that position and see if the lines match. If any of them do, skip that line. If none do, or you don't have any mappings for that hash, remember the (hash, position) and then print the line.
Note, i'm saying "position", not "line number". In order for this to work in less than a year, you'd almost certainly have to be able to seek right to a line rather than finding your way to line #1392499.
If you don't care about time/IO constraints, nor disk constraints (e.g. you have 10 more GB space), you can do the following dumb algorithm:
1) Read the file (which sounds like it has 50 character lines). While scanning it, remember the longest line length $L.
2) Analyze the first 3 characters (if you know char #1 is identical - say "[" - analyze the 3 characters in position N that is likely to have more diverse ones).
3) For each line with 3 characters $XYZ, append that line to file 3char.$XYZ and keep the count of how many lines in that file in a hash.
4) When your entire file is split up that way, you should have a whole bunch (if the files are A-Z only, then 26^3) of smaller files, and at most 4 files that are >2GB each.
5) Move the original file into "Processed" directory.
6) For each of the large files (>2GB), choose the next 3 character positions, and repeat steps #1-#5, with new files being 6char.$XYZABC
7) Lather, rinse, repeat. You will end up with one of the 2 options eventually:
8a) A bunch of smaller files each of which is under 2GB, all of which have mutually different strings, and each (due to its size) can be processed individually by standard "stash into a hash" solution in your question.
8b) Or, most of the files being smaller, but, you have exausted all $L characters while repeating step 7 for >2GB files, and you still have between 1-4 large files. Guess what - since
those up-to-4 large files have identical characters within a file in positions 1..$L, they can ALSO be processed using the "stash into a hash" method in your question, since they are not going to contain more than a few distinct lines despite their size!
Please note that this may require - at the worst possible distributions - 10GB * L / 3 disk space, but will ONLY require 20GB disk space if you change step #5 from "move" to "delete".
Voila. Done.
As an alternate approach, consider hashing your lines. I'm not a hashing expert but you should be able to compress a line into a hash <5 times line size IMHO.
If you want to be fancy about this, you will do a frequency analysis on character sequences on the first pass, and then do compression/encoding this way.
If you have more processor and have at least 15GB free space and your storage fast enough, you could try this out. This will process it in paralel.
split --lines=100000 -d 4 -d input.file
find . -name "x*" -print|xargs -n 1 -P10 -I SPLITTED_FILE sort -u SPLITTED_FILE>unique.SPLITTED_FILE
cat unique.*>output.file
rm unique.* x*
You could break you file into 10 1 Gbyte files Then reading in one file at a time, sorting lines from that file and writing it back out after they are sorted. Opening all of the 10 files and merge them back into one file (making sure you you merge them in the correct order). Open an output file to save the unique lines. Then read the merge file one line at a time, keeping the last line for comparison. If the last line and the current line are not a match, write out the last line and save the current line as the last line for comparison. Otherwise get the next line from the merged file. That will give you a file which has all of the unique lines.
It may take a while to do this, but if you are limited on memory, then breaking the file up and working on parts of it will work.
It may be possible to do the comparison when writing out the file, but that would be a bit more complicated.
Why use perl for this at all? posix shell:
sort | uniq
done, let's go drink beers.

Easiest way to parse a single, large text file across multiple client machines?

I've been given the task of writing a webapp that analyzes text files given a single regular expression. The text files I am given range anywhere from 500MB to 3GB. I am currently using Perl as my parsing engine. I've been reading about mapReduce and Hadoop but it seems like the set up is only worth it given very,very large amounts of data, much larger than the amounts I am parsing.
What would be a good way to go about this? Right now a 500MB file takes anywhere from 4 to 6 minutes to parse, which isn't too bad, but the 3GB files take forever, and the webserver usually times out before it can get output from the Perl script and generate a report.
Let's partition your file into 100 chunks, and use seek to let an arbitrary process work on an arbitrary part of the file.
my $chunk = $ARGV[0]; # a user input, from 0 to 99
my $size = -s $THE_FILE;
my $startByte = int($chunk * $size / 100);
my $endByte = int(($chunk + 1) * $size) / 100);
open my $fh, '<', $THE_FILE;
seek $fh, 0, $startByte;
scalar <$fh>; # skip current line in case we have seek'd to the middle of a line
while (<$fh>) {
# ... process this section of the file ...
last if tell($fh) >= $endByte;
}
Now run this program 100 times on whatever machines you have available, passing the arguments 0 to 99 once to each program.
Actually hadoop is surprisingly easy to install and use (especially if you don't have huge data and don't need to optimize it). I had a similar task a while (processing logs in the range of about 5GB) and it took me no more than a couple of hours to install it on 5 machines, just using the tutorial and doc on their site. Then the programming is really easy, just read from STDIN and write to STDOUT!
Probably making your own split and distribute script (even if you make it on top of something like Gearman) will take more than installing hadoop.

How do I properly format plain text data for a simple Perl dictionary app?

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c

How can I write compressed files on the fly using Perl?

I am generating relatively large files using Perl. The files I am generating are of two kinds:
Table files, i.e. textual files I print line by line (row by row), which contain mainly numbers. A typical line looks like:
126891 126991 14545 12
Serialized objects I create then store into a file using Storable::nstore. These objects usually contain some large hash with numeric values. The values in the object might have been packed to save on space (and the object unpacks each value before using it).
Currently I'm usually doing the following:
use IO::Compress::Gzip qw(gzip $GzipError);
# create normal, uncompressed file ($out_file)
# ...
# compress file using gzip
my $gz_out_file = "$out_file.gz";
gzip $out_file => $gz_out_file or die "gzip failed: $GzipError";
# delete uncompressed file
unlink($out_file) or die "can't unlink file $out_file: $!";
This is quite inefficient since I first write the large file to disk, then gzip read it again and compresses it. So my questions are as following:
Can I create a compressed file without first writing a file to disk? Is it possible to create a compressed file sequentially, i.e. printing line-by-line like in scenario (1) described earlier?
Does Gzip sounds like an appropriate choice? aRe there any other recommended compressors for the kind of data I have described?
Does it make sense to pack values in an object that will later be stored and compressed anyway?
My considerations are mainly saving on disk space and allowing fast decompression later on.
You can use IO::Zlib or PerlIO::gzip to tie a file handle to compress on the fly.
As for what compressors are appropriate, just try several and see how they do on your data. Also keep an eye on how much CPU/memory they use for compression and decompression.
Again, test to see how much pack helps with your data, and how much it affects your performance. In some cases, it may be helpful. In others, it may not. It really depends on your data.
You can also open() a filehandle to a scalar instead of a real file, and use this filehandle with IO::Compress::Gzip. Haven't actually tried it, but it should work. I use something similar with Net::FTP to avoid creating files on disk.
Since v5.8.0, Perl has built using PerlIO by default. Unless you've changed this (i.e., Configure -Uuseperlio), you can open filehandles directly to Perl scalars via:
open($fh, '>', \$variable) || ..
from open()
IO::Compress::Zlib has an OO interface that can be used for this.
use strict;
use warnings;
use IO::Compress::Gzip;
my $z = IO::Compress::Gzip->new('out.gz');
$z->print($_, "\n") for 0 .. 10;