CSV manipulation AWK? - perl

I have two CSV files, one has a long list of reference numbers, the other a daily list of orders.
On a daily basis I need to cut & paste from the reference numbers into the daily orders. Obviously I only cut as many reference numbers as there are orders, so for example if there are 20 orders I need to get 20 reference numbers from the other file and paste into my orders file. I cut these numbers so we don't get duplicates on the next days run.
I want to automate this process but I don't know the best way. I am running windows and have used AWK for some other csv manipulation but I'm not very experienced with AWK and not sure if this is possible so I am just asking if anybody has any ideas of the best solution.

Parsing CSV properly is very tricky business. Most difficulty comes from mistakes in parsing quotes, double quotes, commas, spaces, etc in your content.
Rather than reinventing the wheel, I would recommend using some well tested library. I don't think awk has one, but Perl does: DBD::CSV.
On Windows, simply install ActivePerl, it already has DBD::CSV installed by default.
Then, use Perl code like this to retrieve your data and convert to some other formats inside while loop:
use DBI;
my $dbh = DBI->connect("dbi:CSV:f_ext=.csv") or die $DBI::errstr;
my $sth = $dbh->prepare("SELECT * FROM mytable"); # access mytable.csv
$sth->execute();
while (my #row = $sth->fetchrow_array()) {
print "id: $row[0], name: $row[1]\n";
}
# you can also access columns by name, like this:
# while (my $row = $sth->fetchrow_hashref()) {
# print "id: $row->{id}, name: $row->{name}\n";
# }
$sth->finish();
$dbh->disconnect();
Since you mention you have 2 input CSV files, you might be able to even use SQL join statements to get data from both tables properly joined at once.

Related

Is it okay to keep huge data in Perl data structure

I am receiving some CSVs from client. The average size of these CSVs is 20 MB.
The format is:
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
My current approach:
I store all these records temporarily in a table, and then query in table:
where customer='customer1' and product='product1'
where customer='customer1' and product='product2'
where customer='customer2' and product='product1'
Problem : inserting in DB and then selecting takes too much time. A lot of stuff is happening and it takes 10-12 minutes to process one CSV. I am currently using SQLite and it is quite fast. But I think I'll save some more time if I remove the insertion and selection altogether.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
The machine generally has 500MB+ free RAM.
If the query you show is the only kind of query you want to perform then this is rather straight-forward.
my $orders; # I guess
while (my $row = <DATA> ) {
chomp $row;
my #fields = split /,/, $row;
push #{ $orders->{$fields[0]}->{$fields[1]} } \#fields; # or as a hashref, but that's larger
}
print join "\n", #{ $orders->{Cutomer1}->{Product1}->[0] }; # typo in cuStomer
__DATA__
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
You just build an index into a hash reference that is several levels deep. The first level has the customer. It contains another hashref, which has the list of rows that match this index. Then you can decide if you just want the whole thing as an array ref, or if you want to put a hash ref with keys there. I went with an array ref because that consumes less memory.
Later you can query it easily. I included that above. Here's the output.
Cutomer1
Product1
cat1
many
other
info
If you don't want to remember indexes but have to code a lot of different queries, you could make variables (or even constants) that represent the magic numbers.
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
# build $orders ...
my $res = $orders->{Cutomer1}->{Product2}->[0];
print "Category: " . $res->[CATEGORY];
The output is:
Category: cat2
To order the result, you can use Perl's sort. If you need to sort by two columns, there are answers on SO that explain how to do that.
for my $res (
sort { $a->[OTHER] cmp $b->[OTHER] }
#{ $orders->{Customer2}->{Product1} }
) {
# do stuff with $res ...
}
However, you can only search by Customer and Product like this.
If there is more than one type of query, this gets expensive. If you would also group them by category only, you would either have to iterate all of them every single time you look one up, or build a second index. Doing that is harder than waiting a few extra seconds, so you probably don't want to do that.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
For this specific purpose, absolutely. 20 Megabytes are not a lot.
I've created a test file that is 20004881 bytes and 447848 lines with this code, which is not perfect, but gets the job done.
use strict;
use warnings;
use feature 'say';
use File::stat;
open my $fh, '>', 'test.csv' or die $!;
while ( stat('test.csv')->size < 20_000_000 ) {
my $customer = 'Customer' . int rand 10_000;
my $product = 'Product' . int rand 500;
my $category = 'cat' . int rand 7;
say $fh join ',', $customer, $product, $category, qw(many other info);
}
Here is an excerpt of the file:
$ head -n 20 test.csv
Customer2339,Product176,cat0,many,other,info
Customer2611,Product330,cat2,many,other,info
Customer1346,Product422,cat4,many,other,info
Customer1586,Product109,cat5,many,other,info
Customer1891,Product96,cat5,many,other,info
Customer5338,Product34,cat6,many,other,info
Customer4325,Product467,cat6,many,other,info
Customer4192,Product239,cat0,many,other,info
Customer6179,Product373,cat2,many,other,info
Customer5180,Product302,cat3,many,other,info
Customer8613,Product218,cat1,many,other,info
Customer5196,Product71,cat5,many,other,info
Customer1663,Product393,cat4,many,other,info
Customer6578,Product336,cat0,many,other,info
Customer7616,Product136,cat4,many,other,info
Customer8804,Product279,cat5,many,other,info
Customer5731,Product339,cat6,many,other,info
Customer6865,Product317,cat2,many,other,info
Customer3278,Product137,cat5,many,other,info
Customer582,Product263,cat6,many,other,info
Now let's run our above program with this input file and look at the memory consumption and some statistics of the size of the data structure.
use strict;
use warnings;
use Devel::Size 'total_size';
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
open my $fh, '<', 'test.csv' or die $!;
my $orders;
while ( my $row = <$fh> ) {
chomp $row;
my #fields = split /,/, $row;
$orders->{ $fields[0] }->{ $fields[1] } = \#fields;
}
say 'total size of $orders: ' . total_size($orders);
Here it is:
total size of $orders: 185470864
So that variable consumes 185 Megabytes. That's a lot more than the 20MB of CSV, but we have an easily searchable index. Using htop I figured out that the actual process consumes 287MB. My machine has 16G of memory, so I don't care about that. And with about 3.6s it's reasonably fast to run this program, but I have an SSD a newish CORE i7 machine.
But it will not eat all your memory if you have 500MB to spare. Likely an SQLite approach would consume less memory, but you have to benchmark the speed of this versus the SQLite approach to decide which one is fater.
I used the method described in this answer to read the file into an SQLite database1. I needed to add a header line to the file first, but that's trivial.
$ sqlite3 test.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode csv test
sqlite> .import test.csv test
Since I couldn't measure this properly, let's say it felt like about 2 seconds. Then I added an index for the specific query.
sqlite> CREATE INDEX foo ON test ( customer, product );
This felt like it took another one second. Now I could query.
sqlite> SELECT * FROM test WHERE customer='Customer23' AND product='Product1';
Customer23,Product1,cat2,many,other,info
The result appeared instantaneously (which is not scientific!). Since we didn't measure how long retrieval from the Perl data structure takes, we cannot compare them, but it feels like it all takes about the same time.
However, the SQLite file size is only 38839296, which is about 39MB. That's bigger than the CSV file, but not by a lot. It seems like the sqlite3 process only consumes about 30kB of memory, which I find weird given the index.
In conclusion, the SQLite seems to be a bit more convenient and eat less memory. There is nothing wrong with doing this in Perl, and it might be the same speed, but using SQL for this type of query feels more natural, so I would go with this.
If I might be so bold I would assume you didn't set an index on your table when you did it in SQLite and that made it take longer. The amount of rows we have here is not that much, even for SQLite. Properly indexed it's a piece of cake.
If you don't actually know what an index does, think about a phone book. It has the index of first letters on the sides of the pages. To find John Doe, you grab D, then somehow look. Now imagine there was no such thing. You need to randomly poke around a lot more. And then try to find the guy with the phone number 123-555-1234. That's what your database does if there is no index.
1) If you want to script this, you can also pipe or read the commands into the sqlite3 utility to create the DB, then use Perl's DBI to do the querying. As an example, sqlite3 foo.db <<<'.tables\ .tables' (where the backslash \ represents a literal linebreak) prints the list of tables twice, so importing like this will work, too.

What's the best way to read a huge CSV file using Perl?

Requirements
I have a very large CSV file to read. (about 3 GB)
I won't need all records, I mean, there are some conditionals that we can use, for example, if the 3rd CSV column content has 'XXXX' and 4th column has '999'.
Question:
Can I use these conditionals to improve the read process? If so, how can I do that using Perl?
I need an example (Perl Script) in your answer.
Here's a solution:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
use autodie;
my $csv = Text::CSV_XS->new();
open my $FH, "<", "file.txt";
while (<$FH>) {
$csv->parse($_);
my #fields = $csv->fields;
next unless $fields[1] =~ /something I want/;
# do the stuff to the fields you want here
}
Use Text::CSV
Your a) question has been answered a few times over already, but b) has not yet been addressed:
I won't need all records, I mean,
there are some conditionals that we
can use, for example, if the 3rd CSV
column content has 'XXXX' and 4th
column has '999'. Can I use these
conditionals to improve the read
process?
No. How would you know whether the 3rd CSV column contains 'XXXX' or the 4th is '999' without reading the line first? (DBD::CSV lets you hide this behind an SQL WHILE clause, but, because CSV is unindexed data, it still needs to read in every line to determine which match the condition(s) and which don't.)
Pretty much the only way the content of a line could be used to let you skip reading parts of the file is if it contained information telling you 1) "skip the section following this line" and 2) "continue reading at byte offset nnn".
The Text::CSV module is a great solution for this. Another option is the DBD::CSV module, which provides a slightly different interface. The DBI interface is really useful if you're developing applications that have to access data from different forms of databases, including relational databases and comma-separated text files.
Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
$dbh = DBI->connect ("DBI:CSV:f_dir=/home/joe/csvdb")
or die "Cannot connect: $DBI::errstr";
$sth = $dbh->prepare ("SELECT id, name FROM info.txt WHERE id > 1 ORDER by id");
$sth->execute;
my ($id,$name);
$sth->bind_columns (\$id, \$name);
while ($sth->fetch) {
print "Found result row: id = $id, name = $name\n";
}
$sth->finish;
I'd use Text::CSV for this task unless you're planning on talking to other types of databases, but in Perl TIMTOWDI and it helps to know your options.
use a module like Text::CSV, however, if you know that your data will not have embedded commas and its simple CSV format, then a simple while loop to iterate the file will suffice
while (<>){
chomp;
#s = split /,/;
if ( $s[2] eq "XXXX" && $s[3] eq "999" ){
# do something;
}
}

How can I combine files into one CSV file?

If I have one file FOO_1.txt that contains:
FOOA
FOOB
FOOC
FOOD
...
and a lots of other files FOO_files.txt. Each of them contains:
1110000000...
one line that contain 0 or 1 as the number of FOO1 values (fooa,foob, ...)
Now I want to combine them to one file FOO_RES.csv that will have the following format:
FOOA,1,0,0,0,0,0,0...
FOOB,1,0,0,0,0,0,0...
FOOC,1,0,0,0,1,0,0...
FOOD,0,0,0,0,0,0,0...
...
What is the simple & elegant way to conduct that
(with hash & arrays -> $hash{$key} = \#data ) ?
Thanks a lot for any help !
Yohad
If you can't describe a your data and your desired result clearly, there is no way that you will be able to code it--taking on a simple project is a good way to get started using a new language.
Allow me to present a simple method you can use to churn out code in any language, whether you know it or not. This method only works for smallish projects. You'll need to actually plan ahead for larger projects.
How to write a program:
Open up your text editor and write down what data you have. Make each line a comment
Describe your desired results.
Start describing the steps needed to change your data into the desired form.
Numbers 1 & 2 completed:
#!/usr/bin perl
use strict;
use warnings;
# Read data from multiple files and combine it into one file.
# Source files:
# Field definitions: has a list of field names, one per line.
# Data files:
# * Each data file has a string of digits.
# * There is a one-to-one relationship between the digits in the data file and the fields in the field defs file.
#
# Results File:
# * The results file is a CSV file.
# * Each field will have one row in the CSV file.
# * The first column will contain the name of the field represented by the row.
# * Subsequent values in the row will be derived from the data files.
# * The order of subsequent fields will be based on the order files are read.
# * However, each column (2-X) must represent the data from one data file.
Now that you know what you have, and where you need to go, you can flesh out what the program needs to do to get you there - this is step 3:
You know you need to have the list of fields, so get that first:
# Get a list of fields.
# Read the field definitions file into an array.
Since it is easiest to write CSV in a row oriented fashion, you will need to process all your files before generating each row. So you'll need someplace to store the data.
# Create a variable to store the data structure.
Now we read the data files:
# Get a list of data files to parse
# Iterate over list
# For each data file:
# Read the string of digits.
# Assign each digit to its field.
# Store data for later use.
We've got all the data in memory, now write the output:
# Write the CSV file.
# Open a file handle.
# Iterate over list of fields
# For each field
# Get field name and list of values.
# Create a string - comma separated string with field name and values
# Write string to file handle
# close file handle.
Now you can start converting comments into code. You could have anywhere from 1 to 100 lines of code for each comment. You may find that something you need to do is very complex and you don't want to take it on at the moment. Make a dummy subroutine to handle the complex task, and ignore it until you have everything else done. Now you can solve that complex, thorny sub-problem on it's own.
Since you are just learning Perl, you'll need to hit the docs to find out how to do each of the subtasks represented by the comments you've written. The best resource for this kind of work is the list of functions by category in perlfunc. The Perl syntax guide will come in handy too. Since you'll need to work with a complex data structure, you'll also want to read from the Data Structures Cookbook.
You may be wondering how the heck you should know which perldoc pages you should be reading for a given problem. An article on Perlmonks titled How to RTFM provides a nice introduction to the documentation and how to use it.
The great thing, is if you get stuck, you have some code to share when you ask for help.
If I understand correctly your first file is your key order file, and the remaining files each contain a byte per key in the same order. You want a composite file of those keys with each of their data bytes listed together.
In this case you should open all the files simultaneously. Read one key from the key order file, read one byte from each of the data files. Output everything as you read it to you final file. Repeat for each key.
It looks like you have many foo_files that have 1 line in them, something like:
1110000000
Which stands for
fooa=1
foob=1
fooc=1
food=0
fooe=0
foof=0
foog=0
fooh=0
fooi=0
fooj=0
And it looks like your foo_res is just a summation of those values? In that case, you don't need a hash of arrays, but just a hash.
my #foo_files = (); #NOT SURE HOW YOU POPULATE THIS ONE
my #foo_keys = qw(a b c d e f g h i j);
my %foo_hash = map{ ( $_, 0 ) } #foo_keys; # initialize hash
foreach my $foo_file ( #foo_files ) {
open( my $FOO, "<", $foo_file) || die "Cannot open $foo_file\n";
my $line = <$FOO>;
close( $FOO );
chomp($line);
my #foo_values = split(//, $line);
foreach my $indx ( 0 .. $#foo_keys ) {
last if ( ! $foo_values[ $indx ] ); # or some kind of error checking if the input file doesn't have all the values
$foo_hash{ $foo_keys[$indx] } += $foo_values[ $indx ];
}
}
It's pretty hard to understand what you are asking for, but maybe this helps?
Your specifications aren't clear. You couldn't have a "lots of other files" named FOO_files.txt, because it's only one name. So I'm going to take this as the files-with-data + filelist pattern. In this case, there are files named FOO*.txt, each containing "[01]+\n".
Thus the idea is to process all the files in the filelist file and to insert them all into a result file FOO_RES.csv, comma-delimited.
use strict;
use warnings;
use English qw<$OS_ERROR>;
use IO::Handle;
open my $foos, '<', 'FOO_1.txt'
or die "I'm dead: $OS_ERROR";
#ARGV = sort map { chomp; "$_.txt" } <$foos>;
$foos->close;
open my $foo_csv, '>', 'FOO_RES.csv'
or die "I'm dead: $OS_ERROR";
while ( my $line = <> ) {
my ( $foo_name ) = ( $ARGV =~ /(.*)\.txt$/ );
$foo_csv->print( join( ',', $foo_name, split //, $line ), "\n" );
}
$foo_csv->close;
You don't really need to use a hash. My Perl is a little rusty, so syntax may be off a bit, but basically do this:
open KEYFILE , "foo_1.txt" or die "cannot open foo_1 for writing";
open VALFILE , "foo_files.txt" or die "cannot open foo_files for writing";
open OUTFILE , ">foo_out.txt"or die "cannot open foo_out for writing";
my %output;
while (<KEYFILE>) {
my $key = $_;
my $val = <VALFILE>;
my $arrVal = split(//,$val);
$output{$key} = $arrVal;
print OUTFILE $key."," . join(",", $arrVal)
}
Edit: Syntax check OK
Comment by Sinan: #Byron, it really bothers me that your first sentence says the OP does not need a hash yet your code has %output which seems to serve no purpose. For reference, the following is a less verbose way of doing the same thing.
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(:file :io);
open my $KEYFILE, '<', "foo_1.txt";
open my $VALFILE, '<', "foo_files.txt";
open my $OUTFILE, '>', "foo_out.txt";
while (my $key = <$KEYFILE>) {
chomp $key;
print $OUTFILE join(q{,}, $key, split //, <$VALFILE> ), "\n";
}
__END__

How can I filter out specific column from a CSV file in Perl?

I am just a beginner in Perl and need some help in filtering columns using a Perl script.
I have about 10 columns separated by comma in a file and I need to keep 5 columns in that file and get rid of every other columns from that file. How do we achieve this?
Thanks a lot for anybody's assistance.
cheers,
Neel
Have a look at Text::CSV (or Text::CSV_XS) to parse CSV files in Perl. It's available on CPAN or you can probably get it through your package manager if you're using Linux or another Unix-like OS. In Ubuntu the package is called libtext-csv-perl.
It can handle cases like fields that are quoted because they contain a comma, something that a simple split command can't handle.
CSV is an ill-defined, complex format (weird issues with quoting, commas, and spaces). Look for a library that can handle the nuances for you and also give you conveniences like indexing by column names.
Of course, if you're just looking to split a text file by commas, look no further than #Pax's solution.
Use split to pull the line apart then output the ones you want (say every second column), create the following xx.pl file:
while(<STDIN>) {
chomp;
#fields = split (",",$_);
print "$fields[1],$fields[3],$fields[5],$fields[7],$fields[9]\n"
}
then execute:
$ echo 1,2,3,4,5,6,7,8,9,10 | perl xx.pl
2,4,6,8,10
If you are talking about CSV files in windows (e.g., generated from Excel), you will need to be careful to take care of fields that contain comma themselves but are enclosed by quotation marks.
In this case, a simple split won't work.
Alternatively, you could use Text::ParseWords, which is in the standard library. Add
use Text::ParseWords;
to the top of Pax's example above, and then substitute
my #fields = parse_line(q{,}, 0, $_);
for the split.
You can use some of Perl's built in runtime options to do this on the command line:
$ echo "1,2,3,4,5" | perl -a -F, -n -e 'print join(q{,}, $F[0], $F[3]).qq{\n}'
1,4
The above will -a(utosplit) using the -F(ield) of a comma. It will then join the fields you are interested in and print them back out (with a line separator). This assumes simple data without nested comma's. I was doing this with an unprintable field separator (\x1d) so this wasn't an issue for me.
See http://perldoc.perl.org/perlrun.html#Command-Switches for more details.
Went looking didn't find a nice csv compliant filter program thats flexible to be useful for than just a one-of, so I wrote one. Enjoy.
Basic usage is:
bash$ csvfilter [-r <columnTitle>]* [-quote] <csv.file>
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Text::CSV;
my $always_quote=0;
my #remove;
if ( ! GetOptions('remove:s'=> \#remove,
'quote-always'=>sub {$always_quote=1;}) ) {
die "$0:invalid option (use --remove [--quote-always])";
}
my #cols2remove;
sub filter(#)
{
my #fields=#_;
my #r;
my $i=0;
for my $c (#cols2remove) {
my $p;
#if ( $i $i ) {
push(#r, splice(#fields, $i));
}
return #r;
}
# create just one if these
my $csvOut=new Text::CSV({always_quote=>$always_quote});
sub printLine(#)
{
my #fields=#_;
my $combined=$csvOut->combine(filter(#fields));
my $str=$csvOut->string();
if ( length($str) ) {
print "$str\n";
}
}
my $csv = Text::CSV->new();
my $od;
open($od, "| cat") || die "output:$!";
while () {
$csv->parse($_);
if ( $. == 1 ) {
my $failures=0;
my #cols=$csv->fields;
for my $rm (#remove) {
for (my $c=0; $c$b} #cols2remove);
}
printLine($csv->fields);
}
exit(0);
\
In addition to what people here said about processing comma-separated files, I'd like to note that one can extract the even (or odd) array elements using an array slice and/or map:
#myarray[map { $_ * 2 } (0 .. 4)]
Hope it helps.
My personal favorite way to do CSV is using the AnyData module. It seems to make things pretty simple, and removing a named column can be done rather easily. Take a look on CPAN.
This answers a much larger question, but seems like a good relevant bit of information.
The unix cut command can do what you want (and a whole lot more). It has been reimplemented in Perl.

How can I merge lines in a large, unsorted file without running out of memory in Perl?

I have a very large column-delimited file coming out of a database report in something like this:
field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2
I want the new file to have combine lines like this so it would look something like this:
field1,field2,field3,value1,value2
I'm able to do this using a hash. In this example, the first three fields are the key and I combine value1 and value in a certain order to be the value. After I've read in the file, I just print out the hash table's keys and values into another file. Works fine.
However, I have some concerns since my file is going to be very large. About 8 GB per file.
Would there be a more efficient way of doing this? I'm not thinking in terms of speed, but in terms of memory footprint. I'm concerned that this process could die due to memory issues. I'm just drawing a blank in terms of a solution that would work but wouldn't shove everything into, ultimately, a very large hash.
For full-disclosure, I'm using ActiveState Perl on Windows.
If your rows are sorted on the key, or for some other reason equal values of field1,field2,field3 are adjacent, then a state machine will be much faster. Just read over the lines and if the fields are the same as the previous line, emit both values.
Otherwise, at least, you can take advantage of the fact that you have exactly two values and delete the key from your hash when you find the second value -- this should substantially limit your memory usage.
If you had other Unix like tools available (for example via cygwin) then you could sort the file beforehand using the sort command (which can cope with huge files). Or possibly you could get the database to output the sorted format.
Once the file is sorted, doing this sort of merge is then easy - iterate down a line at a time, keeping the last line and the next line in memory, and output whenever the keys change.
If you don't think the data will fit in memory, you can always tie
your hash to an on-disk database:
use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';
while(my $line = <>){
chomp $line;
my #columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly
my $key = join ',', #columns[0..2];
my $a_key = "$key:metric_a";
my $b_key = "$key:metric_b";
if($columns[3] eq 'A'){
$data{$a_key} = $columns[4];
}
elsif($columns[3] eq 'B'){
$data{$b_key} = $columns[4];
}
if(exists $data{$a_key} && exists $data{$b_key}){
my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
print "$key,$a,$b\n";
# optionally delete the data here, if you don't plan to reuse the database
}
}
Would it not be better to make another export directly from the database into your new file instead of reworking the file you have already output. If this is an option then I would go that route.
You could try something with Sort::External. It reminds me of a mainframe sort that you can use right in the program logic. It's worked pretty well for what I've used it for.