Is it okay to keep huge data in Perl data structure - perl

I am receiving some CSVs from client. The average size of these CSVs is 20 MB.
The format is:
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
My current approach:
I store all these records temporarily in a table, and then query in table:
where customer='customer1' and product='product1'
where customer='customer1' and product='product2'
where customer='customer2' and product='product1'
Problem : inserting in DB and then selecting takes too much time. A lot of stuff is happening and it takes 10-12 minutes to process one CSV. I am currently using SQLite and it is quite fast. But I think I'll save some more time if I remove the insertion and selection altogether.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
The machine generally has 500MB+ free RAM.

If the query you show is the only kind of query you want to perform then this is rather straight-forward.
my $orders; # I guess
while (my $row = <DATA> ) {
chomp $row;
my #fields = split /,/, $row;
push #{ $orders->{$fields[0]}->{$fields[1]} } \#fields; # or as a hashref, but that's larger
}
print join "\n", #{ $orders->{Cutomer1}->{Product1}->[0] }; # typo in cuStomer
__DATA__
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
You just build an index into a hash reference that is several levels deep. The first level has the customer. It contains another hashref, which has the list of rows that match this index. Then you can decide if you just want the whole thing as an array ref, or if you want to put a hash ref with keys there. I went with an array ref because that consumes less memory.
Later you can query it easily. I included that above. Here's the output.
Cutomer1
Product1
cat1
many
other
info
If you don't want to remember indexes but have to code a lot of different queries, you could make variables (or even constants) that represent the magic numbers.
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
# build $orders ...
my $res = $orders->{Cutomer1}->{Product2}->[0];
print "Category: " . $res->[CATEGORY];
The output is:
Category: cat2
To order the result, you can use Perl's sort. If you need to sort by two columns, there are answers on SO that explain how to do that.
for my $res (
sort { $a->[OTHER] cmp $b->[OTHER] }
#{ $orders->{Customer2}->{Product1} }
) {
# do stuff with $res ...
}
However, you can only search by Customer and Product like this.
If there is more than one type of query, this gets expensive. If you would also group them by category only, you would either have to iterate all of them every single time you look one up, or build a second index. Doing that is harder than waiting a few extra seconds, so you probably don't want to do that.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
For this specific purpose, absolutely. 20 Megabytes are not a lot.
I've created a test file that is 20004881 bytes and 447848 lines with this code, which is not perfect, but gets the job done.
use strict;
use warnings;
use feature 'say';
use File::stat;
open my $fh, '>', 'test.csv' or die $!;
while ( stat('test.csv')->size < 20_000_000 ) {
my $customer = 'Customer' . int rand 10_000;
my $product = 'Product' . int rand 500;
my $category = 'cat' . int rand 7;
say $fh join ',', $customer, $product, $category, qw(many other info);
}
Here is an excerpt of the file:
$ head -n 20 test.csv
Customer2339,Product176,cat0,many,other,info
Customer2611,Product330,cat2,many,other,info
Customer1346,Product422,cat4,many,other,info
Customer1586,Product109,cat5,many,other,info
Customer1891,Product96,cat5,many,other,info
Customer5338,Product34,cat6,many,other,info
Customer4325,Product467,cat6,many,other,info
Customer4192,Product239,cat0,many,other,info
Customer6179,Product373,cat2,many,other,info
Customer5180,Product302,cat3,many,other,info
Customer8613,Product218,cat1,many,other,info
Customer5196,Product71,cat5,many,other,info
Customer1663,Product393,cat4,many,other,info
Customer6578,Product336,cat0,many,other,info
Customer7616,Product136,cat4,many,other,info
Customer8804,Product279,cat5,many,other,info
Customer5731,Product339,cat6,many,other,info
Customer6865,Product317,cat2,many,other,info
Customer3278,Product137,cat5,many,other,info
Customer582,Product263,cat6,many,other,info
Now let's run our above program with this input file and look at the memory consumption and some statistics of the size of the data structure.
use strict;
use warnings;
use Devel::Size 'total_size';
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
open my $fh, '<', 'test.csv' or die $!;
my $orders;
while ( my $row = <$fh> ) {
chomp $row;
my #fields = split /,/, $row;
$orders->{ $fields[0] }->{ $fields[1] } = \#fields;
}
say 'total size of $orders: ' . total_size($orders);
Here it is:
total size of $orders: 185470864
So that variable consumes 185 Megabytes. That's a lot more than the 20MB of CSV, but we have an easily searchable index. Using htop I figured out that the actual process consumes 287MB. My machine has 16G of memory, so I don't care about that. And with about 3.6s it's reasonably fast to run this program, but I have an SSD a newish CORE i7 machine.
But it will not eat all your memory if you have 500MB to spare. Likely an SQLite approach would consume less memory, but you have to benchmark the speed of this versus the SQLite approach to decide which one is fater.
I used the method described in this answer to read the file into an SQLite database1. I needed to add a header line to the file first, but that's trivial.
$ sqlite3 test.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode csv test
sqlite> .import test.csv test
Since I couldn't measure this properly, let's say it felt like about 2 seconds. Then I added an index for the specific query.
sqlite> CREATE INDEX foo ON test ( customer, product );
This felt like it took another one second. Now I could query.
sqlite> SELECT * FROM test WHERE customer='Customer23' AND product='Product1';
Customer23,Product1,cat2,many,other,info
The result appeared instantaneously (which is not scientific!). Since we didn't measure how long retrieval from the Perl data structure takes, we cannot compare them, but it feels like it all takes about the same time.
However, the SQLite file size is only 38839296, which is about 39MB. That's bigger than the CSV file, but not by a lot. It seems like the sqlite3 process only consumes about 30kB of memory, which I find weird given the index.
In conclusion, the SQLite seems to be a bit more convenient and eat less memory. There is nothing wrong with doing this in Perl, and it might be the same speed, but using SQL for this type of query feels more natural, so I would go with this.
If I might be so bold I would assume you didn't set an index on your table when you did it in SQLite and that made it take longer. The amount of rows we have here is not that much, even for SQLite. Properly indexed it's a piece of cake.
If you don't actually know what an index does, think about a phone book. It has the index of first letters on the sides of the pages. To find John Doe, you grab D, then somehow look. Now imagine there was no such thing. You need to randomly poke around a lot more. And then try to find the guy with the phone number 123-555-1234. That's what your database does if there is no index.
1) If you want to script this, you can also pipe or read the commands into the sqlite3 utility to create the DB, then use Perl's DBI to do the querying. As an example, sqlite3 foo.db <<<'.tables\ .tables' (where the backslash \ represents a literal linebreak) prints the list of tables twice, so importing like this will work, too.

Related

CSV manipulation AWK?

I have two CSV files, one has a long list of reference numbers, the other a daily list of orders.
On a daily basis I need to cut & paste from the reference numbers into the daily orders. Obviously I only cut as many reference numbers as there are orders, so for example if there are 20 orders I need to get 20 reference numbers from the other file and paste into my orders file. I cut these numbers so we don't get duplicates on the next days run.
I want to automate this process but I don't know the best way. I am running windows and have used AWK for some other csv manipulation but I'm not very experienced with AWK and not sure if this is possible so I am just asking if anybody has any ideas of the best solution.
Parsing CSV properly is very tricky business. Most difficulty comes from mistakes in parsing quotes, double quotes, commas, spaces, etc in your content.
Rather than reinventing the wheel, I would recommend using some well tested library. I don't think awk has one, but Perl does: DBD::CSV.
On Windows, simply install ActivePerl, it already has DBD::CSV installed by default.
Then, use Perl code like this to retrieve your data and convert to some other formats inside while loop:
use DBI;
my $dbh = DBI->connect("dbi:CSV:f_ext=.csv") or die $DBI::errstr;
my $sth = $dbh->prepare("SELECT * FROM mytable"); # access mytable.csv
$sth->execute();
while (my #row = $sth->fetchrow_array()) {
print "id: $row[0], name: $row[1]\n";
}
# you can also access columns by name, like this:
# while (my $row = $sth->fetchrow_hashref()) {
# print "id: $row->{id}, name: $row->{name}\n";
# }
$sth->finish();
$dbh->disconnect();
Since you mention you have 2 input CSV files, you might be able to even use SQL join statements to get data from both tables properly joined at once.

Sorting Huge Hashes in Perl

I am analyzing the frequency of occurrence of groups of words which occur together in a sentence.
Each group consists of 3 words and we have to calculate their frequency.
Example: This is a good time to party because this is a vacation time.
Expected Output:
this is a - 2
is a good - 1
a good time - 1
and so on.
I have written a script which works good and it prints the frequency and sorts it by descending order.
It works by reading one line at a time from the file. For each line, it will convert them to lowercase, split it into individual words and then form an array out of it.
Then, we pick 3 words at a time starting from the left and keep forming a hash storing the count of occurrence. Once done, we shift the left most element in the array and repeat till the time our array consists of more than 3 words.
Question Updated:
The problem is I want to use this script on a file consisting of more than 10 million lines.
After running some tests I observed that it will not work if the number of lines in the input file are more than 400K.
How can I make this script more memory efficient?
Thanks to fxzuz for his suggestions but now I want to make this script work with larger files :)
#!/usr/bin/perl
use strict;
use warnings;
print "Enter the File name: ";
my $input = <STDIN>;
chomp $input;
open INPUT, '<', $input
or die("Couldn't open the file, $input with error: $!\n");
my %c;
while (my $line = <INPUT>) {
chomp $line;
my #x = map lc, split /\W+/, join "", $line;
while (#x>3) {
$c{"#x[0..2]"}++;
shift #x;
}
}
foreach $key (sort {$c{$b} <=> $c{$a}} keys %c) {
if($c{$key} > 20) {
print $key." - ".$c{$key}."\n";
}
}
close INPUT;
This works good and it will print the groups of words in descending order of frequency. It will only print those groups of words which occur more than 20 times.
Now, how do I make this work on a file consisting of more than 1 million or 10 million lines?
I also checked the memory and CPU usage of perl while running this script using top command in Linux and observed that the CPU usage reaches 100% and the memory usage is close to 90% while the script runs on a file consisting of 400K lines.
So, it is not feasible to make it work with a file consisting of 1 million lines. Because the perl process will hang.
How can I make this code more memory efficient?
Apparently, your code is written correctly and will work, but only as long as your data set is not VERY big. If you have a lot of input data (and seems like you DO), it is possible that sorting phase may fail due to lack of memory. If you cannot increase your memory, the only solution is to write your data to disk - in text or database format.
Text format: you can simply write your triplets as you go into text file, one line per triplet. Doing this will increase output size by factor of 3, but it should be still manageable. Then, you can simply use command line gnu sort and uniq tools to get your desirable counts, something like this:
text2triplet.pl <input.txt | sort | uniq -c | sort -r | head -10000
(you may want to store your output for sort into a file and not pipe it if it is very big)
Database format: use DBD::SQLite and create table like this:
CREATE TABLE hash (triplet VARCHAR, count INTEGER DEFAULT 0);
CREATE INDEX idx1 ON hash (triplet);
CREATE INDEX idx2 ON hash (count);
INSERT your triplets into that table as you go, and increase count for duplicates. After data is processed, simply
SELECT * FROM hash
WHERE count > 20
ORDER BY count DESC
and print it out.
Then you can DROP your hash table or simply delete whole SQLite database altogether.
Both of these approaches should allow you to scale to almost size of your disk, but database approach may be more flexible.
You have some problems with declare and using variables. Please add pragma use strict to your script. Use local variable when your work with hash in for block and other. I notice that you have statement if($c{$key} > 20), but hash values <= 2.
#!/usr/bin/perl
use strict;
my %frequency;
while (my $line = <DATA>) {
chomp $line;
my #words = map lc, split /\W+/, $line;
while (#words > 3) {
$frequency{"#words[0,1,2]"}++;
shift #words;
}
}
# sort by values
for my $key (sort {$frequency{$b} <=> $frequency{$a}} keys %frequency) {
printf "%s - %s\n", $key, $frequency{$key};
}
__DATA__
This is a good time to party because this is a vacation time.
OUTPUT
this is a - 2
to party because - 1
is a good - 1
time to party - 1
party because this - 1
because this is - 1
good time to - 1
is a vacation - 1
a good time - 1

Parsing multiple files at a time in Perl

I have a large data set (around 90GB) to work with. There are data files (tab delimited) for each hour of each day and I need to perform operations in the entire data set. For example, get the share of OSes which are given in one of the columns. I tried merging all the files into one huge file and performing the simple count operation but it was simply too huge for the server memory.
So, I guess I need to perform the operation each file at a time and then add up in the end. I am new to perl and am especially naive about the performance issues. How do I do such operations in a case like this.
As an example two columns of the file are.
ID OS
1 Windows
2 Linux
3 Windows
4 Windows
Lets do something simple, counting the share of the OSes in the data set. So, each .txt file has millions of these lines and there are many such files. What would be the most efficient way to operate on the entire files.
Unless you're reading the entire file into memory, I don't see why the size of the file should be an issue.
my %osHash;
while (<>)
{
my ($id, $os) = split("\t", $_);
if (!exists($osHash{$os}))
{
$osHash{$os} = 0;
}
$osHash{$os}++;
}
foreach my $key (sort(keys(%osHash)))
{
print "$key : ", $osHash{$key}, "\n";
}
While Paul Tomblin's answer dealt with filling the hash, here's the same plus opening the files:
use strict;
use warnings;
use 5.010;
use autodie;
my #files = map { "file$_.txt" } 1..10;
my %os_count;
for my $file (#files) {
open my $fh, '<', $file;
while (<$file>) {
my ($id, $os) = split /\t/;
... #Do something with %os_count and $id/$os here.
}
}
We just open each file serially -- Since you need to read all lines from all files, there isn't much more you can do about it. Once you have the hash, you could store it somewhere and load it when the program starts, then skip all lines until the last you read, or simply seek there, if your records premit, which doesn't look like it.

Fast algorithm to check membership of a pair of numbers in large (x,y) coordinates in Perl

I have a list of sorted coordinates (let's call it xycord.txt) that looks like this:
chr1 10003486 10043713
chr1 10003507 10043106
chr2 10003486 10043713
chr2 10003507 10043162
chr2 10003532 10042759
In reality the this file is very2 large with 10^7 lines.
What I want to do is given another two-point coordinates I want to check if they
fall in between any coordinates in xycord.txt file.
The current approach I have is super slow. Because
there are also many others two-point coordinates against this large xycord.txt file.
Is there a fast way to do it?
#!/usr/bin/perl -w
my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";
my %allxycordwithchr;
# skip file opening construct
while (<XYCORD_FILE>) {
my ($chr,$tx,$ty) = split(/\s+/,$_);
push #{$allxycordwithchr{$chr}},$tx."-".$ty;
}
my #chosenchr_cord = #{$allxycordwithchr{$chrid}};
for my $chro_cords (#chosenchr_cord){
my ($repox,$repoy) = split("-",$chro_cord);
my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
if ($stat eq "IN"){
print "IN\n";
}
}
sub is_in_xycoordsfile {
my ($x,$y,$xp,$yp) = #_;
if ( $xp >= $x && $yp <= $y ) {
return "IN";
}
else {
return "OUT";
}
}
Update: I apologize for correcting this. In my earlier posting I oversimplified
the problem.
Actually, there is one more query field (e.g. chromosome name).
Hence the DB/RB-trees/SQL approaches maybe infeasible in this matter?
A few suggestions:
You could store your data in a database, such as MySQL or SQLite. You could then use a simple request such as:
"SELECT * FROM coordinates WHERE x<"+xp+" AND y>"+yp
Provided you have indexes on x and y, this should be super fast.
You could also take a look at R-Trees. I used R-trees a few years ago to store tens of thousands of city coordinates, and I could find the closest city from a given point in a fraction of a second. In your example, you are storing 1D ranges but I'm pretty sure R-trees would work fine too. You might find R-tree implementations for Perl here. Or you can use RectanglesContainingDot, which seems to do what you need.
You could cache the coordinates in memory: each number looks like it would take 4 bytes to store, so this would lead to about 80 MB of memory usage if you have 10^7 couples of numbers. That's what firefox uses on my machine! Of course if you do this, you need to have some sort of daemon running in order to avoid reloading the whole file every time you need to check coordinates.
You can mix solutions 2 & 3.
My preference is for solution 1: it has a good efficiency/complexity ratio.
In addition to Udi Pasmon's good advice, you could also convert your large file to a DBM and then tie the DBM file to a hash for easy look ups.
Convert the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
my $dbfile = "coords.db";
tie my %coords, "DB_File", $dbfile
or die "could not open $dbfile: $!";
while (<>) {
my ($x, $y) = split;
$coords{"$x-$y"} = 1;
}
Check to see if arguments are members of the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_file;
my ($x, $y) = #ARGV;
tie my %coords, "DB_File", "coords.db"
or die "could not open coords.db: $!";
print "IN\n" if $coords{"$x-$y"};
Try a binary search rather than sequential search. There are two appearant options to do this:
Split the files to smaller files (xycord001.txt, xycord002.txt and so on). Now you can easily determine in which file to search, and the search is rather faster. The big con here is that if you need to add data to a file it might get messy.
Make a binary search over the file: Start at the middle, splitting the file into two logical parts. Decide in which part you coordinates might be, and look at the middle of that part. You'll rapidly (exponentially) reduce the size of the file you're looking in, till you'll be searching one line only. Read more about seeking into files; There is a perl example about binary searching a file here.
EDIT: Generally, using a database or DB file is preferred; However, binary file search is the quick-and-dirty way, especially if the script should run on different files on different machines (thanks #MiniQuark, #Chas. Owens)
If both inputs or atleast the large one are sorted you can try a variation of merge-join between them.
If the lookup file (smaller file) isn't too large, then easiest is to just read it in, put it in a hash keyed by the name with sorted arrays of start-end pairs for value.
Then go through each row in the large file, lookup the array of lookup values that could match it by its name. Go through each pair in the lookup array, if the lookup start is less than the input pairs start, discard that value as it can no longer match anything. If the lookup start is past input end, break the loop as no further lookup values can match. If the lookup end is before the input end you have a match and you can add the input and lookup to the list of matches.
My Perl is rusty, so no Perl example code, but I threw together a quick and dirty Python implementation. On my arbitrary randomly generated dataset matching 10M rows to 10k lookup rows for 14k matches took 22s, matching to 100k lookups for 145k matches took 24s and matching to 1M lookups for 1.47M matches took 35s.
If the smaller file is too big to fit in memory at once, it can be loaded in batches of keys as the keys are encountered in the input file.
Restating your question, do you want to print all ranges in a file that contains the (x, y) pair and also have the same id? If that's the case, you don't need to parse the file and storing it in memory.
while (<DATA>) {
my ($chr, $tx, $ty) = split /\s+/;
print "IN : $chr, $tx, $ty\n"
if $chr eq $chrid
&& $point_to_check_x >= $tx
&& $point_to_check_y <= $ty;
}
OK, so let me clarify the problem, based on my understanding of your code. You have a file with a very large number of entries in it. Each entry includes a label "chr1", "chr2", etc. and two numbers, the first of which is less than the second. You then have a query which comprises a label and a number, and you wish to know whether there is a record in the large file which has the same label as the query, and has the two values such that one is less than the query number and the other is greater than it. Essentially, whether the number in the query is within the interval specified by the two numbers in the record.
Provided my understanding is correct, the first thing to notice is that any of the records which do not have the same label as the query have no part to play in the problem. So you can ignore them. Your program reads them all in, puts them in a hash and then doesn't look at most of the data. Sure, if you have several queries to do, you're going to need to keep data for each of the labels you're interested in, but you can throw the rest away as early as possible. This will keep your memory requirement down.
I would be tempted to take this a little further. Is there a possibility of breaking the huge file up into smaller files? It would seem to be a good idea to break it into files which have only certain labels in them. For example, you might have one file per label, or one file for all data with labels beginning with "a", or so on. This way you can open only those files which you know you're going to be interested in, and save yourself a lot of time and effort.
These changes alone may make enough difference for you. If you still need more performance, I would start thinking about how you are storing the records you're going to need. Storing them ordered on the lower (or higher) of the two numbers should cut down a bit the time taken to find what you're after, particularly if you store them in a binary search tree or a trie.
That should give you enough to work on.
PDL for genomic data processing
We processed a lot of files in the same format as you show in your question and found that PDL (documentation) is a very good tool for doing this. You need some time to learn it --- but it's definitely worth the effort (if you do genomics data processing): PDL can process huge files a few thousand times faster than MySQL.
Here are some hints where to go:
First of all, PDL is a language somewhat like Matlab --- but fully integrated with perl. Read the documentation, do some examples. Consult a mathematician for advise which features to use for what purpose.
PDL stores it's data in plain C arrays. Learn about Inline::C and access this data directly from C if PDL doesn't do the job for you. To me, PDL and Inline::C seem like a perfect match: PDL for high-level operations; Inline::C for anything missing. Still PDL is as fast as your best C because it does it's work in C.
use PDL::IO::FastRaw --- store and access data in files on disk. I often write the files "by hand" (see below) and read them as memory mapped files (using PDL::IO::FastRaw::mapfraw, often with the flag ReadOnly=>1). This is the most efficient way to read data in Linux from the disk.
The format of the data files is trivial: just a sequence of C numbers. You can easily write such files in perl with 'print FileHandle pack "i*",#data;' Check 'perldoc -f pack'.
In my experience, just reading the input files line by line and printing them out in binary format is the slowest part of processing: Once you have them ready for PDL to 'mmap' them, processing is much faster.
I hope this advise helps --- even though not much code is given.

How can I merge lines in a large, unsorted file without running out of memory in Perl?

I have a very large column-delimited file coming out of a database report in something like this:
field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2
I want the new file to have combine lines like this so it would look something like this:
field1,field2,field3,value1,value2
I'm able to do this using a hash. In this example, the first three fields are the key and I combine value1 and value in a certain order to be the value. After I've read in the file, I just print out the hash table's keys and values into another file. Works fine.
However, I have some concerns since my file is going to be very large. About 8 GB per file.
Would there be a more efficient way of doing this? I'm not thinking in terms of speed, but in terms of memory footprint. I'm concerned that this process could die due to memory issues. I'm just drawing a blank in terms of a solution that would work but wouldn't shove everything into, ultimately, a very large hash.
For full-disclosure, I'm using ActiveState Perl on Windows.
If your rows are sorted on the key, or for some other reason equal values of field1,field2,field3 are adjacent, then a state machine will be much faster. Just read over the lines and if the fields are the same as the previous line, emit both values.
Otherwise, at least, you can take advantage of the fact that you have exactly two values and delete the key from your hash when you find the second value -- this should substantially limit your memory usage.
If you had other Unix like tools available (for example via cygwin) then you could sort the file beforehand using the sort command (which can cope with huge files). Or possibly you could get the database to output the sorted format.
Once the file is sorted, doing this sort of merge is then easy - iterate down a line at a time, keeping the last line and the next line in memory, and output whenever the keys change.
If you don't think the data will fit in memory, you can always tie
your hash to an on-disk database:
use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';
while(my $line = <>){
chomp $line;
my #columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly
my $key = join ',', #columns[0..2];
my $a_key = "$key:metric_a";
my $b_key = "$key:metric_b";
if($columns[3] eq 'A'){
$data{$a_key} = $columns[4];
}
elsif($columns[3] eq 'B'){
$data{$b_key} = $columns[4];
}
if(exists $data{$a_key} && exists $data{$b_key}){
my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
print "$key,$a,$b\n";
# optionally delete the data here, if you don't plan to reuse the database
}
}
Would it not be better to make another export directly from the database into your new file instead of reworking the file you have already output. If this is an option then I would go that route.
You could try something with Sort::External. It reminds me of a mainframe sort that you can use right in the program logic. It's worked pretty well for what I've used it for.