Sorting Huge Hashes in Perl - perl

I am analyzing the frequency of occurrence of groups of words which occur together in a sentence.
Each group consists of 3 words and we have to calculate their frequency.
Example: This is a good time to party because this is a vacation time.
Expected Output:
this is a - 2
is a good - 1
a good time - 1
and so on.
I have written a script which works good and it prints the frequency and sorts it by descending order.
It works by reading one line at a time from the file. For each line, it will convert them to lowercase, split it into individual words and then form an array out of it.
Then, we pick 3 words at a time starting from the left and keep forming a hash storing the count of occurrence. Once done, we shift the left most element in the array and repeat till the time our array consists of more than 3 words.
Question Updated:
The problem is I want to use this script on a file consisting of more than 10 million lines.
After running some tests I observed that it will not work if the number of lines in the input file are more than 400K.
How can I make this script more memory efficient?
Thanks to fxzuz for his suggestions but now I want to make this script work with larger files :)
#!/usr/bin/perl
use strict;
use warnings;
print "Enter the File name: ";
my $input = <STDIN>;
chomp $input;
open INPUT, '<', $input
or die("Couldn't open the file, $input with error: $!\n");
my %c;
while (my $line = <INPUT>) {
chomp $line;
my #x = map lc, split /\W+/, join "", $line;
while (#x>3) {
$c{"#x[0..2]"}++;
shift #x;
}
}
foreach $key (sort {$c{$b} <=> $c{$a}} keys %c) {
if($c{$key} > 20) {
print $key." - ".$c{$key}."\n";
}
}
close INPUT;
This works good and it will print the groups of words in descending order of frequency. It will only print those groups of words which occur more than 20 times.
Now, how do I make this work on a file consisting of more than 1 million or 10 million lines?
I also checked the memory and CPU usage of perl while running this script using top command in Linux and observed that the CPU usage reaches 100% and the memory usage is close to 90% while the script runs on a file consisting of 400K lines.
So, it is not feasible to make it work with a file consisting of 1 million lines. Because the perl process will hang.
How can I make this code more memory efficient?

Apparently, your code is written correctly and will work, but only as long as your data set is not VERY big. If you have a lot of input data (and seems like you DO), it is possible that sorting phase may fail due to lack of memory. If you cannot increase your memory, the only solution is to write your data to disk - in text or database format.
Text format: you can simply write your triplets as you go into text file, one line per triplet. Doing this will increase output size by factor of 3, but it should be still manageable. Then, you can simply use command line gnu sort and uniq tools to get your desirable counts, something like this:
text2triplet.pl <input.txt | sort | uniq -c | sort -r | head -10000
(you may want to store your output for sort into a file and not pipe it if it is very big)
Database format: use DBD::SQLite and create table like this:
CREATE TABLE hash (triplet VARCHAR, count INTEGER DEFAULT 0);
CREATE INDEX idx1 ON hash (triplet);
CREATE INDEX idx2 ON hash (count);
INSERT your triplets into that table as you go, and increase count for duplicates. After data is processed, simply
SELECT * FROM hash
WHERE count > 20
ORDER BY count DESC
and print it out.
Then you can DROP your hash table or simply delete whole SQLite database altogether.
Both of these approaches should allow you to scale to almost size of your disk, but database approach may be more flexible.

You have some problems with declare and using variables. Please add pragma use strict to your script. Use local variable when your work with hash in for block and other. I notice that you have statement if($c{$key} > 20), but hash values <= 2.
#!/usr/bin/perl
use strict;
my %frequency;
while (my $line = <DATA>) {
chomp $line;
my #words = map lc, split /\W+/, $line;
while (#words > 3) {
$frequency{"#words[0,1,2]"}++;
shift #words;
}
}
# sort by values
for my $key (sort {$frequency{$b} <=> $frequency{$a}} keys %frequency) {
printf "%s - %s\n", $key, $frequency{$key};
}
__DATA__
This is a good time to party because this is a vacation time.
OUTPUT
this is a - 2
to party because - 1
is a good - 1
time to party - 1
party because this - 1
because this is - 1
good time to - 1
is a vacation - 1
a good time - 1

Related

Need serious help optimizing script for memory use

[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]
Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.
I've got a script that produces frequency lists. It essentially does the following:
Reads in lines from a file having the format $frequency \t $item. Any given $item may occur multiple times, with different values for $frequency.
Eliminates certain lines depending on the content of $item.
Sums the frequencies of all identical $items, regardless of their case, and merges these entries into one.
Performs a reverse natural sort on the resulting array.
Prints the results to an output file.
The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).
The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.
So I need to somehow optimize the script. And that's what I'm here asking for help with.
Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.
I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.
Thanks in advance!
(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.
#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);
# DEFINE VARIABLES
my $delimiter = "\t";
my $split_char = "\t";
my $input_file_name = "";
my $output_file_name = "";
my $in_basename = "";
my $frequency = 0;
my $item = "";
# READ COMMAND LINE OPTIONS
GetOptions (
"input|i=s" => \$input_file_name,
"output|o=s" => \$output_file_name,
);
# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
die
"\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}
# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {
# STRIP EXTENSION FROM INPUT FILE NAME
$in_basename = $input_file_name;
$in_basename =~ s/(.+)\.(.+)/$1/;
# GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
$output_file_name = "$in_basename.output.txt";
}
# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
or die "\nERROR: Can't open input file ($input_file_name): $!";
# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";
# PROCESS INPUT FILE LINE BY LINE
my %F;
while (<INPUTFILE>) {
chomp;
# PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
( $frequency, $item ) = split( /$split_char/, $_, 2 );
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;
# PROCESS INPUT LINES
$F{ lc($item) } += $frequency;
}
close INPUTFILE;
# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
|| die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";
# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}
close OUTPUTFILE;
exit;
Below is some sample input from the source file. It's tab-separated, and the first column is $frequency, while all the rest together is $item.
2 útil volver a valdivia
8 útil volver la vista
1 útil válvula de escape
1 útil vía de escape
2 útil vía fax y
1 útil y a cabalidad
43 útil y a el
17 útil y a la
1 útil y a los
21 útil y a quien
1 útil y a raíz
2 útil y a uno
UPDATE In my tests a hash takes 2.5 times the memory that its data "alone" would take. However, the program size for me is consistently 3-4 times as large as its variables. This would turn 6.3Gb data file into a ~ 15Gb hash, for a ~ 60Gb program, just as reported in comments.
So 6.3Gb == 60Gb, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.
There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.
Here is a simple re-write of the core of the program, to try first.
# ... set filenames, variables
open my $fh_in, '<:encoding(utf8)', $input_file_name
or die "\nERROR: Can't open input file ($input_file_name): $!";
my %F;
while (<$fh_in>) {
chomp;
s/^\s*//; #/trim leading space
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
# ... increment counters and aggregates and add to hash
# (... any other processing?)
$F{ lc($item) } += $frequency;
}
close $fh_in;
# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
foreach my $item ( sort {
$F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b
} keys %F )
{
print $fh_out $F{$item}, "\t", $item, "\n";
}
close $fh_out;
A few comments, let me know if more is needed.
Always add $! to error-related prints, to see the actual error. See perlvar.
Use lexical filehandles (my $fh rather than IN), it's better.
If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ... (but it doesn't hurt either).
The sort here has to at least copy its input, and with multiple conditions more memory is needed.
That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.
Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.
Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.
But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.
Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
my $cumulative_freq;
while (<$fh_in>) {
chomp;
s/^\s*//; #/ leading only
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
$cumulative_freq += $frequency; # would-be hash value
# Add a sort criterion, $item's length, helpful for later sorting
say $fh_out $cumulative_freq, "\t", length $item, "\t", lc($item);
#say $fh_out $cumulative_freq, "\t", lc($item);
}
close $fh_out;
Now we can use the system's sort, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key, run in a terminal
sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3- > result
The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut which pulls out the first and third fields from STDIN (with tab as default delimiter), what is the needed result.
A systemic solution is to use a database, and a very convenient one is DBD::SQLite.
I used Devel::Size to see memory used by variables.
Sorting input requires keeping all input in memory, so you can't do everything in a single process.
However, sorting can be factored: you can easily sort your input into sortable buckets, then process the buckets, and produce the correct output by combining the outputs in reversed-sorted bucket order. The frequency counting can be done per bucket as well.
So just keep the program you have, but add something around it:
partition your input into buckets, e.g. by the first character or the first two characters
run your program on each bucket
concatenate the output in the right order
Your maximum memory consumption will be slightly more than what your original program takes on the biggest bucket. So if your partitioning is well chosen, you can arbitrarily drive it down.
You can store the input buckets and per-bucket outputs to disk, but you can even connect the steps directly with pipes (creating a subprocess for each bucket processor) - this will create a lot of concurrent processes, so the OS will be paging like crazy, but if you're careful, it won't need to write to disk.
A drawback of this way of partitioning is that your buckets may end up being very uneven in size. An alternative is to use a partitioning scheme that is guaranteed to distribute the input equally (e.g. by putting every nth line of input into the nth bucket) but that makes combining the outputs more complex.

Is it okay to keep huge data in Perl data structure

I am receiving some CSVs from client. The average size of these CSVs is 20 MB.
The format is:
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
My current approach:
I store all these records temporarily in a table, and then query in table:
where customer='customer1' and product='product1'
where customer='customer1' and product='product2'
where customer='customer2' and product='product1'
Problem : inserting in DB and then selecting takes too much time. A lot of stuff is happening and it takes 10-12 minutes to process one CSV. I am currently using SQLite and it is quite fast. But I think I'll save some more time if I remove the insertion and selection altogether.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
The machine generally has 500MB+ free RAM.
If the query you show is the only kind of query you want to perform then this is rather straight-forward.
my $orders; # I guess
while (my $row = <DATA> ) {
chomp $row;
my #fields = split /,/, $row;
push #{ $orders->{$fields[0]}->{$fields[1]} } \#fields; # or as a hashref, but that's larger
}
print join "\n", #{ $orders->{Cutomer1}->{Product1}->[0] }; # typo in cuStomer
__DATA__
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
You just build an index into a hash reference that is several levels deep. The first level has the customer. It contains another hashref, which has the list of rows that match this index. Then you can decide if you just want the whole thing as an array ref, or if you want to put a hash ref with keys there. I went with an array ref because that consumes less memory.
Later you can query it easily. I included that above. Here's the output.
Cutomer1
Product1
cat1
many
other
info
If you don't want to remember indexes but have to code a lot of different queries, you could make variables (or even constants) that represent the magic numbers.
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
# build $orders ...
my $res = $orders->{Cutomer1}->{Product2}->[0];
print "Category: " . $res->[CATEGORY];
The output is:
Category: cat2
To order the result, you can use Perl's sort. If you need to sort by two columns, there are answers on SO that explain how to do that.
for my $res (
sort { $a->[OTHER] cmp $b->[OTHER] }
#{ $orders->{Customer2}->{Product1} }
) {
# do stuff with $res ...
}
However, you can only search by Customer and Product like this.
If there is more than one type of query, this gets expensive. If you would also group them by category only, you would either have to iterate all of them every single time you look one up, or build a second index. Doing that is harder than waiting a few extra seconds, so you probably don't want to do that.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
For this specific purpose, absolutely. 20 Megabytes are not a lot.
I've created a test file that is 20004881 bytes and 447848 lines with this code, which is not perfect, but gets the job done.
use strict;
use warnings;
use feature 'say';
use File::stat;
open my $fh, '>', 'test.csv' or die $!;
while ( stat('test.csv')->size < 20_000_000 ) {
my $customer = 'Customer' . int rand 10_000;
my $product = 'Product' . int rand 500;
my $category = 'cat' . int rand 7;
say $fh join ',', $customer, $product, $category, qw(many other info);
}
Here is an excerpt of the file:
$ head -n 20 test.csv
Customer2339,Product176,cat0,many,other,info
Customer2611,Product330,cat2,many,other,info
Customer1346,Product422,cat4,many,other,info
Customer1586,Product109,cat5,many,other,info
Customer1891,Product96,cat5,many,other,info
Customer5338,Product34,cat6,many,other,info
Customer4325,Product467,cat6,many,other,info
Customer4192,Product239,cat0,many,other,info
Customer6179,Product373,cat2,many,other,info
Customer5180,Product302,cat3,many,other,info
Customer8613,Product218,cat1,many,other,info
Customer5196,Product71,cat5,many,other,info
Customer1663,Product393,cat4,many,other,info
Customer6578,Product336,cat0,many,other,info
Customer7616,Product136,cat4,many,other,info
Customer8804,Product279,cat5,many,other,info
Customer5731,Product339,cat6,many,other,info
Customer6865,Product317,cat2,many,other,info
Customer3278,Product137,cat5,many,other,info
Customer582,Product263,cat6,many,other,info
Now let's run our above program with this input file and look at the memory consumption and some statistics of the size of the data structure.
use strict;
use warnings;
use Devel::Size 'total_size';
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
open my $fh, '<', 'test.csv' or die $!;
my $orders;
while ( my $row = <$fh> ) {
chomp $row;
my #fields = split /,/, $row;
$orders->{ $fields[0] }->{ $fields[1] } = \#fields;
}
say 'total size of $orders: ' . total_size($orders);
Here it is:
total size of $orders: 185470864
So that variable consumes 185 Megabytes. That's a lot more than the 20MB of CSV, but we have an easily searchable index. Using htop I figured out that the actual process consumes 287MB. My machine has 16G of memory, so I don't care about that. And with about 3.6s it's reasonably fast to run this program, but I have an SSD a newish CORE i7 machine.
But it will not eat all your memory if you have 500MB to spare. Likely an SQLite approach would consume less memory, but you have to benchmark the speed of this versus the SQLite approach to decide which one is fater.
I used the method described in this answer to read the file into an SQLite database1. I needed to add a header line to the file first, but that's trivial.
$ sqlite3 test.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode csv test
sqlite> .import test.csv test
Since I couldn't measure this properly, let's say it felt like about 2 seconds. Then I added an index for the specific query.
sqlite> CREATE INDEX foo ON test ( customer, product );
This felt like it took another one second. Now I could query.
sqlite> SELECT * FROM test WHERE customer='Customer23' AND product='Product1';
Customer23,Product1,cat2,many,other,info
The result appeared instantaneously (which is not scientific!). Since we didn't measure how long retrieval from the Perl data structure takes, we cannot compare them, but it feels like it all takes about the same time.
However, the SQLite file size is only 38839296, which is about 39MB. That's bigger than the CSV file, but not by a lot. It seems like the sqlite3 process only consumes about 30kB of memory, which I find weird given the index.
In conclusion, the SQLite seems to be a bit more convenient and eat less memory. There is nothing wrong with doing this in Perl, and it might be the same speed, but using SQL for this type of query feels more natural, so I would go with this.
If I might be so bold I would assume you didn't set an index on your table when you did it in SQLite and that made it take longer. The amount of rows we have here is not that much, even for SQLite. Properly indexed it's a piece of cake.
If you don't actually know what an index does, think about a phone book. It has the index of first letters on the sides of the pages. To find John Doe, you grab D, then somehow look. Now imagine there was no such thing. You need to randomly poke around a lot more. And then try to find the guy with the phone number 123-555-1234. That's what your database does if there is no index.
1) If you want to script this, you can also pipe or read the commands into the sqlite3 utility to create the DB, then use Perl's DBI to do the querying. As an example, sqlite3 foo.db <<<'.tables\ .tables' (where the backslash \ represents a literal linebreak) prints the list of tables twice, so importing like this will work, too.

Find doublet data in csv file

I'm trying to write a Perl script that can check if a csv file has doublet data in the two last columns. If doublet data is found, an additional column with the word "doublet" should be added:
Example, the original file looks like this:
cat,111,dog,555
cat,444,dog,222
mouse,333,dog,555
mouse,555,cat,555
The final output file should look like this:
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555
I'm very much a newbie to Perl scripting, so I won't expose myself with what i've written so far. I tried to read through the file splitting the data into two arrays, one with the first two columns, and the other with the last two columns
The idea was then to check for doublets in the second array, and add (push?) the additional column with the "doublets" information to that array, and then afterwards merge to two array back together again(?)
Unfortunately my brain has now collapsed, and I need help from someone smarter than me, to guide me in the right direction.
Any help would be very much appreciated, thanks.
This is not the most efficient way but here is something to get you started. Script assumes that your input data is comma separated and can contain any number of columns.
#!/usr/bin/env perl
use strict;
use warnings;
my %h;
my #lines;
while (<>) {
chomp;
push (#lines,$_); # save each line
my #fields = split(/,/,$_);
if(#fields > 1) {
$h{join("",#fields[-2,-1])}++; # keep track of how many times a doublet appears.
}
}
# go back through the lines. If doublet appears 2 or more times, append ',doublet' to the output.
foreach (#lines) {
my $d = "";
my #fields = split(/,/,$_);
if (#fields > 1 && $h{join("",#fields[-2,-1])} >= 2) {
$d = ",doublet";
}
print $_,$d,$/;
}
The syntax #fields[-2,-1] is an array slice that returns an array with the last two column values. Then, join("",...) concatenates them together and this becomes the key to the hash. $/ is the input record separator which is newline by default and is quicker to write than "\n"
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

How can I merge lines in a large, unsorted file without running out of memory in Perl?

I have a very large column-delimited file coming out of a database report in something like this:
field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2
I want the new file to have combine lines like this so it would look something like this:
field1,field2,field3,value1,value2
I'm able to do this using a hash. In this example, the first three fields are the key and I combine value1 and value in a certain order to be the value. After I've read in the file, I just print out the hash table's keys and values into another file. Works fine.
However, I have some concerns since my file is going to be very large. About 8 GB per file.
Would there be a more efficient way of doing this? I'm not thinking in terms of speed, but in terms of memory footprint. I'm concerned that this process could die due to memory issues. I'm just drawing a blank in terms of a solution that would work but wouldn't shove everything into, ultimately, a very large hash.
For full-disclosure, I'm using ActiveState Perl on Windows.
If your rows are sorted on the key, or for some other reason equal values of field1,field2,field3 are adjacent, then a state machine will be much faster. Just read over the lines and if the fields are the same as the previous line, emit both values.
Otherwise, at least, you can take advantage of the fact that you have exactly two values and delete the key from your hash when you find the second value -- this should substantially limit your memory usage.
If you had other Unix like tools available (for example via cygwin) then you could sort the file beforehand using the sort command (which can cope with huge files). Or possibly you could get the database to output the sorted format.
Once the file is sorted, doing this sort of merge is then easy - iterate down a line at a time, keeping the last line and the next line in memory, and output whenever the keys change.
If you don't think the data will fit in memory, you can always tie
your hash to an on-disk database:
use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';
while(my $line = <>){
chomp $line;
my #columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly
my $key = join ',', #columns[0..2];
my $a_key = "$key:metric_a";
my $b_key = "$key:metric_b";
if($columns[3] eq 'A'){
$data{$a_key} = $columns[4];
}
elsif($columns[3] eq 'B'){
$data{$b_key} = $columns[4];
}
if(exists $data{$a_key} && exists $data{$b_key}){
my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
print "$key,$a,$b\n";
# optionally delete the data here, if you don't plan to reuse the database
}
}
Would it not be better to make another export directly from the database into your new file instead of reworking the file you have already output. If this is an option then I would go that route.
You could try something with Sort::External. It reminds me of a mainframe sort that you can use right in the program logic. It's worked pretty well for what I've used it for.