I have the following file:
a b 5
c d 6
e f 7
g h 4
i j 3
k l 10
and I want to find which line present the minimum value in the third column and erase it from the initial file. After this, I want to iterate again the program and find again which line present the minimum and make the same thing for 2 more times.
The output file should be
c d 6
e f 7
k l 10
I tried to write the following code:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file1 = "try.dat";
open my $fg, "<", $file1 or die "Error during $file1 opening: $!";
my #vector;
while (<$fg>) {
push #vector, [ split ];
}
my $minimum = $vector[0][2];
my #blacklist;
for my $i (0 .. $#vector) {
if ($minimum > $vector[$i][2]){
$minimum = $vector[$i][2];
push #blacklist, #vector[$i+1];
}
}
#say "my minimum is $minimum";
#say "the blacklist is composed by #blacklist";
I don't know how to erase the elements contained by the #blacklist (that in the first case should be i j 3) and how to iterate everything.
Any help for the iteration?
This sort of thing is what Tie::File was made for. It allows you to modify the file in-place by modfying a tied array.
This program does what you want. The helper function minidx returns the first index of the element of the passed array that holds the smallest value.
The program works by copying the third field of the file records into array #field3, and finding the index of the smallest value in there. The element at that index is then deleted from both the file and #field3 using splice.
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', 'file.txt' or die $!;
my #field3 = map { (split)[2] } #file;
for (1 .. 3) {
my $i = minidx(\#field3);
splice #file, $i, 1;
splice #field3, $i, 1;
}
sub minidx {
my ($arr) = #_;
my ($i, $v);
for (0 .. $#$arr) {
($i, $v) = ($_, $arr->[$_]) unless defined $v and $arr->[$_] >= $v;
}
return $i;
}
output
c d 6
e f 7
k l 10
When you say #blacklist = #vector you are adding the entire #vector array to the black list. You probably want to do a push #blacklist, $vector[$i]. That will push the array reference into blacklist.
Now, blacklist has an array ref in it, so you have to deference it to print it.
say "the blacklist is composed by #{$blacklist[0]}";
Edit: For iterating and writing:
I would skip the #blacklist array (unless you need it for something else) and remove the min values from #vector. Then you can write #vector to some file.
my $num_elts_to_remove = 3;
for (my $j = 0; $j < $num_elts_to_remove; $j++) {
my $minimum = $vector[0][2];
my $min_idx = 0;
for my $i (0 .. $#vector) {
if ($minimum > $vector[$i][2]){
$minimum = $vector[$i][2];
$min_idx = $i;
}
}
push #blacklist, $vector[$min_index];
splice #vector, $min_idx, 1; #remove array with smallest value
}
Now write the array to a file
open OUT, ">", $outfile or die "Error: $!";
foreach(#vector) {
print OUT join " ", #$_;
print OUT "\n";
}
close(OUT);
Prints:
c d 6
e f 7
k l 10
Taking Borodin's Tie::File suggestion even further. I have written a cute module called Tie::Array::CSV which allow you to treat a delimited file as an array (and because it uses Tie::File underneath, it is both read and write). Because of this I can use Perlish operations like map and sort (and Schwartzian transform!) to perform this task:
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::Array::CSV;
tie my #data, 'Tie::Array::CSV', 'data', sep_char => ' ';
# get a list of row ids sorted by last value (inc)
my $i = 0;
my #sorted =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$i++, $_->[-1]] }
#data;
#splice the rows by index of the lowest three (from bottom to top)
splice #data, $_, 1 for reverse sort #sorted[0..2];
Note that in the end you want to remove rows from the bottom so that you don't have to reindex every time.
Related
I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764
I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!
I have these 2 text files and I would like to find any mismatch on 2nd column between files. The mismatch to be identified is based on type of F ,P and N regardless which lines they occur. I have 1F, 3P in first file while 2P,1N and 1F in second file. When do comparison, both files should have equal occurrence of type 1F, 3P and 1N.
Text1:
f0x11 F
f0x34 P
drx99
dex67 P
edx43 P
sdx33
Text2:
1 P
2 N
4
5 F
6
7 P
Expected Output:
Text 1 has missing type of N
Text 2 has missing type of P
What I have tried so far does not produce desired output.
code:
use strict;
my %ref_data;
my %ref_data2;
open my $fh, '<', 'Text1' or die "Could not open file to read:$!";
while (<$fh>) {
chomp;
my ($res, $type) = split;
if (defined $type){
$ref_data{$type} = "$type";
}
}
our ($data,$data2);
open $fh, '<', 'Text2' or die "Could not open file to read:$!";
while (<$fh>) {
chomp;
my ($res, $type) = split;
if (defined $type){
$ref_data2{$type}= "$type";
$data2= $ref_data2{$type};
$data = $ref_data{$type};
print "File 2 has missing type of $type\n" unless $data;
}
}
foreach ($data){
print "File 1 has missing type of $_\n" if $data ne $data2;
}
You appear to want to keep track of how many times the values in Column 2 occur within each file -- for example, in a comment you write, "I have 1F, 3P in first file while 2P,1N and 1Fin second file". If that's the case, you need a better data structure.
Specifically, one that counts occurrences of the values in Column 2, and you need those counts to be tracked separately for each file. That suggests a hash-of-hashes.
use strict;
use warnings;
# Example usage:
# perl YOUR_SCRIPT.pl a.txt b.txt
my #files = #ARGV;
# Count the values in Column 2, organizing the tallies like this:
# $tallies{COL_2}{FILE_NAME} = N
my %tallies;
while (<>) {
my #cols = split;
$tallies{$cols[1]}{$ARGV} ++ if #cols > 1;
}
# Print discrepancies.
for my $c2 (keys %tallies) {
my #t = map { $tallies{$c2}{$_} || 0 } #files;
next if $t[0] == $t[1];
print "$c2: $files[0] has $t[0]; $files[1] has $t[1]\n";
}
Example output:
N: a.txt has 0; b.txt has 1
P: a.txt has 3; b.txt has 2
Also worth noting: this code does not have to open any files explicitly, and file names are not hard-coded in the program. Instead, we pass input file names as command-line arguments, get those arguments via #ARGV, process lines in those files via <>, and know which file we're currently processing via $ARGV.
I've refactored your code where you seem to be duplicating the same behavior.
The output isn't to spec, but should be clear enough for you to understand and finish up yourself.
I added a close $fh; and use warnings; as well
#!/usr/bin/perl
use strict;
use warnings;
#run
my %max; # hash of combined data
my $file_data_1 = parse_file_into_hash("text1", \%max);
my $file_data_2 = parse_file_into_hash("text2", \%max);
diff_hashes(\%max, $file_data_1, $file_data_2);
# diff_hashes($max, $h1, $h2)
#
# diffs 2 hash refs against a combined $max hash and prints results
sub diff_hashes {
my ($max, $h1, $h2) = #_;
# TODO - do all the comparisios and some error checking (if keys exist etc...) here
for my $key (keys %$max) {
print "max/combined: $key = $max->{$key}\n";
my $h1_print = exists $h1->{$key} ? $h1->{$key} : "0";
my $h2_print = exists $h2->{$key} ? $h2->{$key} : "0";
print "h1: $key = $h1_print\n";
print "h2: $key = $h2_print\n";
}
}
# parse_file_into_hash($file, $max)
#
# $max is a hash reference (passed by reference) so you can count occurences over
# multiple files...
# returns reference of hash ( $line_number => $data_value )
sub parse_file_into_hash {
my ($file, $max) = #_;
my %ref_data;
open my $fh, '<', $file or die "Could not open file to read:$!";
while (my $line = <$fh>) {
chomp $line;
my ($res, $type) = split /\s+/, $line;
if ($type) {
$ref_data{$type}++;
if (!exists $max->{$type} || $ref_data{$type} > $max->{$type}) {
$max->{$type} = $ref_data{$type};
}
}
}
close $fh;
return \%ref_data;
}
Output ran against your example files:
$ ./example.pl
max/combined: F = 1
h1: F = 1
h2: F = 1
max/combined: N = 1
h1: N = 0
h2: N = 1
max/combined: P = 3
h1: P = 3
h2: P = 2
still having trouble with perl programming and I need to be pushed to make a script work out.
I have two files and I want to use the list file to "extract" rows from the data one. The problem is that the list file is formatted as follow:
X1 A B
X2 C D
X3 E F
And my data looks like this:
A X1 2 5
B X1 3 7
C X2 1 4
D X2 1 5
I need to obtain the element pairs from the list file by which select the row in the data file. At the same time I would like to write an output like this:
X1 A B 2 5 3 7
X2 C D 1 4 1 5
I'm trying writing a perl code, but I'm not able to produce something useful. I'm at this point:
open (LIST, "< $fils_list") || die "impossibile open the list";
#list = <LIST>;
close (LIST);
open (HAN, "< $data") || die "Impossible open data";
#r = <HAN>;
close (HAN);
for ($p=0; $p<=$#list; $p++){
chomp ($list[$p]);
($x, $id1, $id2) = split (/\t/, $list[$p]);
$pair_one = $id1."\t".$x;
$pair_two = $id2."\t".$x;
for ($i=0; $i<=$#r; $i++){
chomp ($r[$i]);
($a, $b, $value1, $value2) = split (/\t/, $r[$i]);
$bench = $a."\t".$b;
if (($pair_one eq $bench) || ($pair_two eq $bench)){
print "I don't know what does this script must print!\n";
}
}
}
I'm not able to rationalize about what to print.
Any kind of suggestion is very welcome!
A few general recommendations:
Indent your code to show the structure of your program.
Use meaningful variable names, not $a or $value1 (if I do so below, this is due to my lack of domain knowledge).
Use data structures that suit your program.
Don't do operations like parsing a line more that once.
In Perl, every program should use strict; use warnings;.
use autodie for automatic error handling.
Also, use the open function like open my $fh, "<", $filename as this is safer.
Remember what I said about data structures? In the second file, you have entries like
A X1 2 5
This looks like a secondary key, a primary key, and some data columns. Key-value relationships are best expressed through a hash table.
use strict; use warnings; use autodie;
use feature 'say'; # available since 5.010
open my $data_fh, "<", $data;
my %data;
while (<$data_fh>) {
chomp; # remove newlines
my ($id2, $id1, #data) = split /\t/;
$data{$id1}{$id2} = \#data;
}
Now %data is a nested hash which we can use for easy lookups:
open my $list_fh, "<", $fils_list;
LINE: while(<$list_fh>) {
chomp;
my ($id1, #id2s) = split /\t/;
my $data_id1 = $data{$id1};
defined $data_id1 or next LINE; # maybe there isn't anything here. Then skip
my #values = map #{ $data_id1->{$_} }, #id2s; # map the 2nd level ids to their values and flatten the list
# now print everything out:
say join "\t", $id1, #id2s, #values;
}
The map function is a bit like a foreach loop, and builds a list of values. We need the #{ ... } here because the data structure doesn't hold arrays, but references to arrays. The #{ ... } is a dereference operator.
This is how i would do it, mostly using Hashes resp. Hash- and Array-References (test1.txt and test2.txt contain the data you provided in your example):
use strict;
use warnings;
open(my $f1, '<','test1.txt') or die "Cannot open file1: $!\n";
open(my $f2, '<','test2.txt') or die "Cannot open file2: $!\n";
my #data1 = <$f1>;
my #data2 = <$f2>;
close($f1);
close($f2);
chomp #data1;
chomp #data2;
my %result;
foreach my $line1 (#data1) {
my #fields1 = split(' ',$line1);
$result{$fields1[0]}->{$fields1[1]} = [];
$result{$fields1[0]}->{$fields1[2]} = [];
}
foreach my $line2 (#data2){
my #fields2 = split(' ',$line2);
push #{$result{$fields2[1]}->{$fields2[0]}}, $fields2[2];
push #{$result{$fields2[1]}->{$fields2[0]}}, $fields2[3];
}
foreach my $res (sort keys %result){
foreach (sort keys %{$result{$res}}){
print $res . " " . $_ . " " . join (" ", sort #{$result{$res}->{$_}}) . "\n";
}
}
Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};