i keep learning hashes and various things u can do with them.
taday i have this question. how do i sort a hash by value, when i have 2 keys in it? and how do i print it out?
i have a csv file. im trying to store values in the hash, sort it by value. this way I'll be able to print the biggest and the smallest value, i also need the date this value was there.
so far i can print the hash, but i cant sort it.
#!/usr/bin/perl
#find openMin and openMax.
use warnings;
use strict;
my %pick;
my $key1;
my $key2;
my $value;
my $file= 'msft2.csv';
my $lines = 0;
my $date;
my $mm;
my $mOld = "";
my $open;
my $openMin;
my $openMax;
open (my $fh,'<', $file) or die "Couldnt open the $file:$!\n";
while (my $line=<$fh>)
{
my #columns = split(',',$line);
$date = $columns[0];
$open = $columns[1];
$mm = substr ($date,5,2);
if ($lines>=1) { #first line of file are names of columns wich i
$key1 = $date; #dont need. data itself begins with second line
$key2 = "open";
$value = $open;
$pick{$key1}{"open"}=$value;
}
$lines++;
}
foreach $key1 (sort keys %pick) {
foreach $key2 (keys %{$pick{$key1}}) {
$value = $pick{$key1}{$key2};
print "$key1 $key2 $value \n";
}
}
exit;
1. Use a real CSV parser
Parsing a CSV with split /,/ works fine...unless one of your fields contains a comma. If you are absolutely, positively, 100% sure that your code will never, ever have to parse a CSV with a comma in one of the fields, feel free to ignore this. If not, I'd recommend using Text::CSV. Example usage:
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, "<", $file or die "Failed to open $file: $!";
while (my $line = $csv->getline($fh)) {
print #$line, "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
2. Sorting
I only see one secondary key in your hash: open. If you're trying to sort based on the value of open, do something like this:
my %hash = (
foo => { open => "date1" },
bar => { open => "date2" },
);
foreach my $key ( sort { $hash{$a}{open} cmp $hash{$b}{open} } keys %hash ) {
print "$key $hash{$key}{open}\n";
}
(this assumes that the values you're sorting are not numeric. If the values are numeric (e.g. 3, -17.57) use the spaceship operator <=> instead of the string comparison operator cmp. See perldoc -f sort for details and examples.)
EDIT: You haven't explained what format your dates are in. If they are in YYYY-MM-DD format, sorting as above will work, but if they're in MM-DD-YYYY format, for example, 01-01-2014 would come before 12-01-2013. The easiest way to take care of this is to reorder the components of your date from most to least significant (i.e. year followed by month followed by day). You can do this using Time::Piece like this:
use Time::Piece;
my $date = "09-26-2013";
my $t = Time::Piece->strptime($date, "%m-%d-%Y");
print $t->strftime("%Y-%m-%d");
Another tidbit: in general you should only declare variables right before you use them. You gain nothing by declaring everything at the top of your program except decreased readability.
You could concatenate key1 and key2 into a single key as:
$key = "$key1 key2";
$pick{$key} = $value;
Related
I am currently writing a perl script where I have a reference to an array (students) of references. After adding the hash references to the array. Now I add the references to the array of students and then ask the user how to sort them. This is where it gets confusing. I do not know how to deference the sorted array. Using dumper I can get the sorted array but in a unorganized output. How can I deference the array of hash references after sorting?
#!bin/usr/perl
use strict;
use warnings;
use Data::Dumper;
use 5.010;
#reference to a var $r = \$var; Deferencing $$r
#reference to an array $r = \#var ; Deferencing #$r
#referenc to a hash $r = \%var ; deferencing %$r
my $filename = $ARGV[0];
my $students = [];
open ( INPUT_FILE , '<', "$filename" ) or die "Could not open to read \n ";
sub readLines{
while(my $currentLine = <INPUT_FILE>){
chomp($currentLine);
my #myLine = split(/\s+/,$currentLine);
my %temphash = (
name => "$myLine[0]",
age => "$myLine[1]",
GPA => "$myLine[2]",
MA => "$myLine[3]"
);
pushToStudents(\%temphash);
}
}
sub pushToStudents{
my $data = shift;
push $students ,$data;
}
sub printData{
my $COMMAND = shift;
if($COMMAND eq "sort up"){
my #sortup = sort{ $a->{name} cmp $b->{name} } #$students;
print Dumper #sortup;
}elsif($COMMAND eq "sort down"){
my #sortdown = sort{ $b->{name} cmp $a->{name} } #$students;
print Dumper #sortdown;
//find a way to deference so to make a more organize user friendly read.
}else{
print "\n quit";
}
}
readLines();
#Output in random, the ordering of each users data is random
printf"please choose display order : ";
my $response = <STDIN>;
chomp $response;
printData($response);
The problem here is that you're expected Dumper to provide an organised output. It doesn't. It dumps a data structure to make debugging easier. The key problem will be that hashes are explicitly unordered data structures - they're key-value mappings, they don't produce any output order.
With reference to perldata:
Note that just because a hash is initialized in that order doesn't mean that it comes out in that order.
And specifically the keys function:
Hash entries are returned in an apparently random order. The actual random order is specific to a given hash; the exact same series of operations on two hashes may result in a different order for each hash.
There is a whole section in perlsec which explains this in more detail, but suffice to say - hashes are random order, which means whilst you're sorting your students by name, the key value pairs for each student isn't sorted.
I would suggest instead of:
my #sortdown = sort{ $b->{name} cmp $a->{name} } #$students;
print Dumper #sortdown;
You'd be better off with using a slice:
my #field_order = qw ( name age GPA MA );
foreach my $student ( sort { $b -> {name} cmp $a -> {name} } #$students ) {
print #{$student}{#field_order}, "\n";
}
Arrays (#field_order) are explicitly ordered, so you will always print your student fields in the same sequence. (Haven't fully tested for your example I'm afraid, because I don't have your source data, but this approach works with a sample data snippet).
If you do need to print the keys as well, then you may need a foreach loop instead:
foreach my $field ( #field_order ) {
print "$field => ", $student->{$field},"\n";
}
Or perhaps the more terse:
print "$_ => ", $student -> {$_},"\n" for #field_order;
I'm not sure I like that as much though, but that's perhaps a matter of taste.
The essence of your mistake is to assume that hashes will have a specific ordering. As #Sobrique explains, that assumption is wrong.
I assume you are trying to learn Perl, and therefore, some guidance on the basics will be useful:
#!bin/usr/perl
Your shebang line is wrong: On Windows, or if you run your script with perl script.pl, it will not matter, but you want to make sure the interpreter that is specified in that line uses an absolute path.
Also, you may not always want to use the perl interpreter that came with the system, in which case #!/usr/bin/env perl maybe helpful for one-off scripts.
use strict;
use warnings;
use Data::Dumper;
use 5.010;
I tend to prefer version constraints before pragmata (except in the case of utf8). Data::Dumper is a debugging aid, not something you use for human readable reports.
my $filename = $ARGV[0];
You should check if you were indeed given an argument on the command line as in:
#ARGV or die "Need filename\n";
my $filename = $ARGV[0];
open ( INPUT_FILE , '<', "$filename" ) or die "Could not open to read \n ";
File handles such as INPUT_FILE are called bareword filehandles. These have package scope. Instead, use lexical filehandles whose scope you can restrict to the smallest appropriate block.
There is no need to interpolate $filename in the third argument to open.
Always include the name of the file and the error message when dying from an error in open. Surrounding the filename with ' ' helps you identify any otherwise hard to detect characters that might be causing the problem (e.g. a newline or a space).
open my $input_fh, '<', $filename
or die "Could not open '$filename' for reading: $!";
sub readLines{
This is reading into an array you defined in global scope. What if you want to use the same subroutine to read records from two different files into two separate arrays? readLines should receive a filename as an argument, and return an arrayref as its output (see below).
while(my $currentLine = <INPUT_FILE>){
chomp($currentLine);
In most cases, you want all trailing whitespace removed, not just the line terminator.
my #myLine = split(/\s+/,$currentLine);
split on /\s+/ is different than split ' '. In most cases, the latter is infinitely more useful. Read about the differences in perldoc -f split.
my %temphash = (
name => "$myLine[0]",
age => "$myLine[1]",
GPA => "$myLine[2]",
MA => "$myLine[3]"
);
Again with the useless interpolation. There is no need to interpolate those values into fresh strings (except maybe in the case where they might be objects which overloaded the stringification, but, in this case, you know they are just plain strings.
pushToStudents(\%temphash);
No need for the extra pushToStudents subroutine in this case, unless this is a stub for a method that will later be able to load the data to a database or something. Even in that case, it be better to provide a callback to the function.
sub pushToStudents{
my $data = shift;
push $students ,$data;
}
You are pushing data to a global variable. A program where there can only ever be a single array of student records is not useful.
sub printData{
my $COMMAND = shift;
if($COMMAND eq "sort up"){
Don't do this. Every subroutine should have one clear purpose.
Here is a revised version of your program.
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use Carp qw( croak );
run(\#ARGV);
sub run {
my $argv = $_[0];
#$argv
or die "Need name of student records file\n";
open my $input_fh, '<', $argv->[0]
or croak "Cannot open '$argv->[0]' for reading: $!";
print_records(
read_student_records($input_fh),
prompt_sort_order(),
);
return;
}
sub read_student_records {
my $fh = shift;
my #records;
while (my $line = <$fh>) {
last unless $line =~ /\S/;
my #fields = split ' ', $line;
push #records, {
name => $fields[0],
age => $fields[1],
gpa => $fields[2],
ma => $fields[3],
};
}
return \#records;
}
sub print_records {
my $records = shift;
my $sorter = shift;
if ($sorter) {
$records = [ sort $sorter #$records ];
}
say "#{ $_ }{ qw( age name gpa ma )}" for #$records;
return;
}
sub prompt_sort_order {
my #sorters = (
[ "Input order", undef ],
[ "by name in ascending order", sub { $a->{name} cmp $b->{name} } ],
[ "by name in descending order", sub { $b->{name} cmp $a->{name} } ],
[ "by GPA in ascending order", sub { $a->{gpa} <=> $b->{gpa} } ],
[ "by GPA in descending order", sub { $b->{gpa} <=> $a->{gpa} } ],
);
while (1) {
print "Please choose the order in which you want to print the records\n";
print "[ $_ ] $sorters[$_ - 1][0]\n" for 1 .. #sorters;
printf "\n\t(%s)\n", join('/', 1 .. #sorters);
my ($response) = (<STDIN> =~ /\A \s*? ([1-9][0-9]*?) \s+ \z/x);
if (
$response and
($response >= 1) and
($response <= #sorters)
) {
return $sorters[ $response - 1][1];
}
}
# should not be reached
return;
}
I need to sort hash key using perl also i need to allow duplicate in key. So that i planned to check exists method in perl if it is exists then i increment a last digit then i will store into hash.
I tried the following code:
use strict;
use warnings;
use iPerl::Basic qw(_save_file _open_file);
my $xml = $ARGV[0];
my ($xmlcnt,$backcnt,$refcnt,$name,$year) = "";
my %sort = ();
if(($#ARGV != 0) or(not -f "$xml") or($xml!~ m{\.xml$}i)){
print_exit("\t\tSYSTAX ERROR: <EXE> <xml File>\n\n")
};
$xmlcnt=_open_file($xml);
$xmlcnt =~ s{<back(?: [^>]+)?>(?:(?!</?back[ >]).)*</back>}{
$backcnt = $&;
while($backcnt =~ m{<ref(?: [^>]+)?>(?:(?!<ref[ >]).)*</ref>}igs){
$refcnt = $&;
$name = $1 if($refcnt =~ m{<person-group(?: [^>]+)?>((?:(?!</?person-group[ >]).)*)</person-group>}is);
$year = $1 if($refcnt =~ m{<year>((?:(?!</?year[ >]).)*)</year>}is);
$name =~ s{</?(?:string-name|surname|given-names)>}{}ig;
my $count = 1;
my $keys="$name $year\E$count";
if(exists ($sort{$keys})){
$keys =~ s{(\d)$}{my $icr=$1;$icr++;qq($icr)}e;
#print"$keys\n";
$sort{$keys}="$refcnt";
}
else
{
$sort{$keys}="$refcnt";
}
print join("\n",keys %sort);
}
qq($backcnt)
}igse;
my #keys = sort {
$sort{$a} <=> $sort{$b}
# or
# "\L$a" cmp "\L$b"
} keys %sort;
# print join("\n",#keys);
sub print_exit {
my $msg = shift;
#print "\n$msg";
exit;
}
Please can anyone tell me what went wrong here?
input:
thieooieroh
apple
apple
highefhfe
bufghifeh
output:
apple
apple
bufghifeh
highefhfe
thieooieroh
Thanks in advance.
From a very brief look at your code, it appears that you want to store refcounts as the values in your hash, with the ability to have multiple counts for a single key. This is easily doable by using a hash of arrays (commonly abbreviated to HoA). Each key must, by definition, be unique, but the associated value can be a reference, allowing you to store multiple items under that key, or to build even more complex data structures.
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
my %hash;
while (my $line = <DATA>) {
chomp $line;
my ($key, $count) = split ',', $line;
push #{$hash{$key}}, $count;
}
for my $key (sort keys %hash) {
my $values = $hash{$key};
for (#$values) {
say "$key ($_)";
}
}
__DATA__
thieooieroh,1
apple,2
apple,3
highefhfe,4
bufghifeh,5
Output:
apple (2)
apple (3)
bufghifeh (5)
highefhfe (4)
thieooieroh (1)
If you're not actually concerned with storing multiple data items with each key, but only with the number of times each key appears, it's even simpler. Change the two loops in the above code to:
while (my $line = <DATA>) {
chomp $line;
$hash{$line}++;
}
for my $key (sort keys %hash) {
say $key for 1 .. $hash{$key};
}
and you get the output
apple
apple
bufghifeh
highefhfe
thieooieroh
As for the rest of your posted code, don't try to parse XML with regexes. Arbitrary XML cannot be parsed beyond a very crude first approximation by regular expressions because XML is not structurally "regular". There are many fine XML-parsing modules on CPAN which will parse your XML correctly for you, while also requiring far less effort from you than trying to write your own parser. Use one of them. Not regexes.
I have two files.
One composed of a unique list while the other one is a redundant list of name with the age.
for example
File1: File2:
Gaia Gaia 3
Matt Matt 12
Jane Gaia 89
Reuben 4
My aim is to match File1 and File2 and to retrieve the highest age for each name.
So far I have written the below code.
The bit that do not work quite well is: when the same key is found in the hash, print the bigger value.
Any suggestion/comment is welcome!
Thanks!!
#!/usr/bin/perl -w
use strict;
open (FILE1, $ARGV[0] )|| die "unable to open arg1\n"; #Opens first file for comparison
open (FILE2, $ARGV[1])|| die "unable to open arg2\n"; #2nd for comparison
my #not_red = <FILE1>;
my #exonslength = <FILE2>;
#2) Produce an Hash of File2. If the key is already in the hash, keep the couple key- value with the highest value. Otherwise, next.
my %hash_doc2;
my #split_exons;
my $key;
my $value;
foreach my $line (#exonslength) {
#split_exons = split "\t", $line;
#hash_doc2 {$split_exons[0]} = ($split_exons[1]);
if (exists $hash_doc2{$split_exons[0]}) {
if ( $hash_doc2{$split_exons[0]} > values %hash_doc2) {
$hash_doc2{$split_exons[0]} = ($split_exons[1]);
} else {next;}
}
}
#3) grep the non redundant list of gene from the hash with the corresponding value
my #a = grep (#not_red,%hash_doc2);
print "#a\n";
Do you need to keep all the values? If not, you can only keep the max value:
#split_exons = split "\t", $line;
if (exists $hash_doc2{$slit_exons[0]}
and $hash_doc2{$slit_exons[0]} < $split_exons[1]) {
$hash_doc2{$split_exons[0]} = $split_exons[1];
}
You code does not keep all the values, either. You cannot store an array into a hash value, you have to store a reference. Adding a new value to an array can by done by push:
push #{ $hash_doc2{$split_exons[0]} }, $split_exons[1];
Your use of numeric comparison against values is also not doing what you think. The < operator imposes a scalar context, so values returns the number of values. Another option would be to store the values sorted and always ask for the highest value:
$hash_doc2{$split_exons[0]} = [ sort #{ $hash_doc2{$split_exons[0]} }, $split_exons[1] ];
# max for $x is at $hash_doc2{$x}[-1]
Instead of reading in the whole of file2 into an array (which will be bad if it's big), you could loop through and process the data file line by line:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
open( my $nameFh, '<', $ARGV[0]);
open( my $dataFh, '<', $ARGV[1]);
my $dataHash = {};
my $processedHash = {};
while(<$dataFh>){
chomp;
my ( $name, $age ) = split /\s+/, $_;
if(! defined($dataHash->{$name}) or $dataHash->{$name} < $age ){
$dataHash->{$name} = $age
}
}
while(<$nameFh>){
chomp;
$processedHash->{$_} = $dataHash->{$_} if defined $dataHash->{$_};
}
print Dumper($processedHash);
I am fairly new to Perl so hopefully this has a quick solution.
I have been trying to combine two files based on a key. The problem is there are multiple values instead of the one it is returning. Is there a way to loop through the hash to get the 1-10 more values it could be getting?
Example:
File Input 1:
12345|AA|BB|CC
23456|DD|EE|FF
File Input2:
12345|A|B|C
12345|D|E|F
12345|G|H|I
23456|J|K|L
23456|M|N|O
32342|P|Q|R
The reason I put those last one in is because the second file has a lot of values I don’t want but file 1 I want all values. The result I want is something like this:
WANTED OUTPUT:
12345|AA|BB|CC|A|B|C
12345|AA|BB|CC|D|E|F
12345|AA|BB|CC|G|H|I
23456|DD|EE|FF|J|K|L
23456|DD|EE|FF|M|N|O
Attached is the code I am currently using. It gives an output like so:
OUTPUT I AM GETTING:
12345|AA|BB|CC|A|B|C
23456|DD|EE|FF|J|K|L
My code so far:
#use strict;
#use warnings;
open file1, "<FILE1.txt";
open file2, "<FILE2.txt";
while(<file2>){
my($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~ /(.+)\|(.+)\|(.+)\|(.+)/;
$value4 = "$value1|$value2|$value3";
$file2Hash{$key} = $value4;
}
while(<file1>){
my ($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~/(.+)\|(.+)\|(.+)\|(.+)/;
if (exists $file2Hash{$key}) {
print $line."|".$file2Hash{$key}."\n";
}
else {
print $line."\n";
}
}
Thank you for any help you may provide,
Your overall idea is sound. However in file2, if you encounter a key you have already defined, you overwrite it with a new value. To work around that, we store an array(-ref) inside our hash.
So in your first loop, we do:
push #{$file2Hash{$key}}, $value4;
The #{...} is just array dereferencing syntax.
In your second loop, we do:
if (exists $file2Hash{$key}){
foreach my $second_value (#{$file2Hash{$key}}) {
print "$line|$second_value\n";
}
} else {
print $line."\n";
}
Beyond that, you might want to declare %file2Hash with my so you can reactivate strict.
Keys in a hash must be unique. If keys in file1 are unique, use file1 to create the hash. If keys are not unique in either file, you have to use a more complicated data structure: hash of arrays, i.e. store several values at each unique key.
I assume that each key in FILE1.txt is unique and that each unique key has at least one corresponding line in FILE2.txt.
Your approach is then quite close to what you need, you should just use FILE1.txt to create the hash from (as already mentioned here).
The following should work:
#!/usr/bin/perl
use strict;
use warnings;
my %file1hash;
open file1, "<", "FILE1.txt" or die "$!\n";
while (<file1>) {
my ($key, $rest) = split /\|/, $_, 2;
chomp $rest;
$file1hash{$key} = $rest;
}
close file1;
open file2, "<", "FILE2.txt" or die "$!\n";
while (<file2>) {
my ($key, $rest) = split /\|/, $_, 2;
if (exists $file1hash{$key}) {
chomp $rest;
printf "%s|%s|%s\n", $key, $file1hash{$key}, $rest;
}
}
close file2;
exit 0;
Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};