calculate average of small parts of column, grouped by key in perl? - perl

This question is quite similar to this one How can I get the average and standard deviations grouped by key? but I don't manage to modify it to fit my problem.
I have a lot of files (.csv) with 7 columns, the last three columns look like this:
col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
1409,2,345
1409,3,456
1409,4,567
1409,5,678
1409,6,789
...
N,0,123
N,1,234
N,2,345
N,3,456
N,4,567
N,5,678
N,6,789
What I want to do is to calculate the average of the last column (col7) for all the values that have the same value in column 5 (col5), so 1408, 1409, 1410, ... until N and I don't know N. I want to print this average value next to the line (in col8) which contains a 3 in column 6 (col6). Do note that the value in column 6 (col6) goes from 0 to 6, but the first number of the file is not always 0. So what I want is:
col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456
...
bla,bla,bla,bla,N,3,456,456
I have some script I can use to calculate the average, but I have to be able to put my values into an array for that. Below is what I tried to do, but it doesn't work. Also, I'm just trying to learn Perl on my own, so if it looks like crap, I'm just trying!
open (FILE, "<", $dir.$file) or die;
my #lines = <FILE>;
foreach my $line(#lines) {
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
push #arrays5, $col5;
}
foreach my $array5(#arrays5) {
foreach my $line(#lines) {
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
if ($array5 == $col5) {
push #arrays7, $col7;
}
}
}
close(FILE);

One way using Text::CSV_XS module. It's not a built-in one, so it has to be installed from CPAN or similar tool.
Content of script.pl:
use warnings;
use strict;
use Text::CSV_XS;
my ($offset, $col_total, $row3, $rows_processed);
## Check arguments to the script.
die qq[Usage: perl $0 <input-file>\n] unless #ARGV == 1;
## Open input file.
open my $fh, q[<], shift or die qq[Open error: $!\n];
## Create the CSV object.
my $csv = Text::CSV_XS->new or
die qq[ERROR: ] . Text::CSV_XS->error_diag();
## Read file content seven lines each time.
while ( my $rows = $csv->getline_all( $fh, $offset, 7 ) ) {
## End when there is no more rows.
last unless #$rows;
## For each row in the group of seven...
for my $row ( 0 .. $#{$rows} ) {
## Get value of last column.
my $last_col_value = $rows->[ $row ][ $#{$rows->[$row]} ];
## If last column is not a number it is the header, so print it
## appending the eigth column and read next one.
unless ( $last_col_value =~ m/\A\d+\Z/ ) {
$csv->print( \*STDOUT, $rows->[ $row ] );
printf qq[,%s\n], q[col8];
next;
}
## Acumulate total amount for last column.
$col_total += $last_col_value;
## Get third row. The output will be this row with the
## average appended.
if ( $rows->[ $row ][-2] == 3 ) {
$row3 = [ #{ $rows->[ $row ] } ];
}
## Count processed rows.
++$rows_processed;
}
## Print row with its average.
if ( $rows_processed > 0 && ref $row3 ) {
$csv->print( \*STDOUT, $row3 );
printf qq[,%g\n], $col_total / $rows_processed;
}
## Initialize variables.
$col_total = $rows_processed = 0;
undef $row3;
}
Content of infile:
col1,col2,col3,col4,col5,col6,col7
bla,bla,bla,bla,1408,1,123
bla,bla,bla,bla,1408,2,234
bla,bla,bla,bla,1408,3,345
bla,bla,bla,bla,1408,4,456
bla,bla,bla,bla,1408,5,567
bla,bla,bla,bla,1408,6,678
bla,bla,bla,bla,1409,0,123
bla,bla,bla,bla,1409,1,234
bla,bla,bla,bla,1409,2,345
bla,bla,bla,bla,1409,3,456
bla,bla,bla,bla,1409,4,567
bla,bla,bla,bla,1409,5,678
bla,bla,bla,bla,1409,6,789
Run it like:
perl script.pl infile
With following output:
col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456

Before we try to complete the answer, would you try this and tell me how close it comes to what you want?
#!/usr/bin/perl
use warnings;
use strict;
my $target = 3;
my %summary;
while(<>) {
chomp;
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split /\,/;
$summary{$col5}{total} += $col7;
++$summary{$col5}{count};
$summary{$col5}{line} = $_ if $col6 == $target;
}
$summary{$_}{average} = $summary{$_}{total} / $summary{$_}{count}
for keys %summary;
print "${summary{$_}{line}},${summary{$_}{average}}\n"
for sort keys %summary;
If close enough, then you may wish to finish on your own. If not, then we can discuss the matter further.
Note that you can replace the <> with <FILE> if you prefer to read from your data file rather than from standard input.
IMPLEMENTATION NOTES
The code relies on Perl's autovivification feature. Observe for instance the line ++$summary{$col5}{count};, which seems initially to increment a nonexistent counter. However, this is actually standard Perl idiom. If you try to do something arithmetical (like incrementation) to an object that does not exist, Perl implicitly creates the object, initializes it to zero, and then does the thing you wanted (like incrementation) to it.
It would probably be unwise for a more sober programming language like C++ to autovivify, but years of experience suggest that autovivification strikes the right balance between order and convenience in a slightly less sober language like Perl.
On a more elementary level, the code will probably make sense only to those used to Perl's hashes. However, if you've not used Perl's hashes before, this would be as good a chance as any to learn them. The hash is a central pillar of the language, and the above makes a fairly typical example of its use.
In this case, we have a hash of hashes, which again is fairly typical.

This should do the trick. Replace Cols[index] appropriately.
use Data::Dumper ;
open (FILE, "<", '/tmp/myfile') or die;
my #lines ;
my (%Sum,%Count);
chomp(#lines = <FILE>);
foreach my $line(#lines) {
next if $line =~ /col/;
my #Cols = split /,/, $line;
$Sum{$Cols[0]} += $Cols[2] ;
$Count{$Cols[0]}++;
}
foreach my $line(#lines) {
if($line=~/col/) {
print "$line,colX\n" ;
next;
}
my #Cols = split /,/, $line;
if($Cols[1]==3) {
print "$line,",$Sum{$Cols[0]}/$Count{$Cols[0]},"\n" ;
} else {
print "$line,-1\n";
}
}
Sample input /tmp/myfile
col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
Sample output
col5,col6,col7,colX
1408,1,123,-1
1408,2,234,-1
1408,3,345,400.5
1408,4,456,-1
1408,5,567,-1
1408,6,678,-1
1409,0,123,-1
1409,1,234,-1

Related

How to use perl to parse a text file

I am new to perl,
I have text file contains 2 columns:
lib1 cell1
lib1 cell2
lib2 cell3
lib2 cell1
I would like to use perl to find there is duplicated in name in column 2 then print the name of column 1
In this text cell1 is repeated 2 times.
I would like to have a report something like:
cell1 found in lib1 lib2
I use the code below to read and open the file
#!/usr/bin/env perl
use strict;
use warnings;
for my $file ( #ARGV ){
open my$in_fh, '<', $file or die "could not open $file: $!\n";
while( my $line = <$in_fh> ){
chomp( $line );
print "$line\n"
}
}
But I don't know how to find the duplicated name in second column and print the 1st column
There are a few things here that Perl can do for you.
First, Perl will handle opening and reading the files you specify on the command line this the empty deadline operator (and here I'm using the safer double diamond version introduced in v5.22):
use v5.22
while( <<>> ) {
...
}
Then, you can track what you've seen with a hash. Extract the columns, and use the interesting column as the key in the hash. Here I post-increment it's value. On the first go around, the post increment returns 0 (then increases the value by 1), so the conditional is false the first time. The next time it sees that same key, the value is true, so it warns:
use v5.22
my %Seen;
while( <<>> ) {
chomp;
my( $first, $second ) = split;
if( $Seen{$second}++ ) {
warn "Duplicated second column! Line $.\n";
}
}
The hash is a great way to track things that are strings instead of positions.
Now, you want to know which values in the first column appear with each value in the second. You could get a bit more fancy with that hash and make another level in the hash to store the first column. Perl automatically takes care of the details for you (and we have extended examples of this in Intermediate Perl.
First, accumulate the data in the hash:
use v5.22
my %Seen;
while( <<>> ) {
chomp;
my( $first, $second ) = split;
$Seen{$second}{$first}++;
}
Once you have the hash, you move on to the second step of reporting the data. All the values of the second column are the top level keys for the hash. With that key, get the second level of the hash, and get those keys, which are the first column:
foreach my $second ( keys %Seen ) {
my #firsts = keys %{ $Seen{$second} };
say "$second found in #firsts";
}
With v5.24's postfix dereferencing, that's slightly cleaner since the dereference reads left to right rather than inside out:
use v5.24;
foreach my $second ( keys %Seen ) {
my #firsts = keys $Seen{$second}->%*;
say "$second found in #firsts";
}
And, since the hash keys in the second level only appear once per value, you don't have duplicates.

How loading of file into memory in perl

i have tried with some script for sorting a input text file in descending order and print only top usage customer.
input text file containts:
NAME,USAGE,IP
example :
Abc,556,10.2.3.5
bbc,126,14.2.5.6
and so on, this is very large file and i am trying to avoid loading file into memory.
I have tried with following script.
use warnings ;
use strict;
my %hash = ();
my $file = $ARGV[0] ;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while (my $line =<$fh>)
{
chomp ($line) ;
my( $name,$key,$ip) = split /,/, $line;
$hash{$key} = [ $name, $ip ];
}
my $count= 0 ;
foreach ( sort { $b <=> $a } keys %hash ){
my $value = $hash{$_};
print "$_ #{$value} \n" ;
last if (++$count == 5);
}
Output should be sorted based on usage and it will show the name and IP for respective usage. " `
I think you want to print the five lines of the file that have the highest value in the second column
That can be done by a sort of insertion sort that checks each line of the file to see if it comes higher than the lowest of the five lines most recently found, but it's easier to just accumulate a sensible subset of the data, sort it, and discard all but the top five
Here, I have array #top containing lines from the file. When there are 100 lines in the array, it is sorted and reduced to the five maximal entries. Then the while loop continues to add lines to the file until it reaches the limit again or the end of the file has been reached, when the process is repeated. That way, no more than 100 lines from the file are ever help in memory
I have generated a 1,000-line data file to test this with random values between 100 and 2,000 in column 2. The output below is the result
use strict;
use warnings 'all';
open my $fh, '<', 'usage.txt' or die $!;
my #top;
while ( <$fh> ) {
push #top, $_;
if ( #top >= 100 or eof ) {
#top = sort {
my ($aa, $bb) = map { (split /,/)[1] } ($a, $b);
$bb <=> $aa;
} #top;
#top = #top[0..4];
}
}
print #top;
output
qcmmt,2000,10.2.3.5
ciumt,1999,10.2.3.5
eweae,1998,10.2.3.5
gvhwv,1998,10.2.3.5
wonmd,1993,10.2.3.5
The standard way to do this is to create a priority queue that contains k items, where k is the number of items you want to return. So if you want the five lines that have the highest value, you'd do the following:
pq = new priority_queue
add the first five items in the file to the priority queue
for each remaining line in the file
if value > lowest value on pq
remove lowest value on the pq
add new value to pq
When you're done going through the file, pq will contain the five items with the highest value.
To do this in Perl, use the Heap::Priority module.
This will be faster and use less memory than the other suggestions.
Algorithm remembering the last 5 biggest rows.
For each row, check with the lowest memorized element. If more - are stored in the array before next biggest item with unshift lowest.
use warnings;
use strict;
my $file = $ARGV[0] ;
my #keys=(0,0,0,0,0);
my #res;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while(<$fh>)
{
my($name,$key,$ip) = split /,/;
next if($key<$keys[0]);
for(0..4) {
if($_==4 || $key<$keys[$_+1]) {
#keys[0..$_-1]=#keys[1..$_] if($_>0);
$keys[$_]=$key;
$res[$_]=[ $name, $ip ];
last;
}
}
}
for(0..4) {
print "$keys[4-$_] #{$res[4-$_]}";
}
Test on file from 1M random rows (20 Mbytes):
Last items (This algorithm):
Start 1472567980.91183
End 1472567981.94729 (duration 1.03546 seconds)
full sort in memory (Algorithm of #Rishi):
Start 1472568441.00438
End 1472568443.43829 (duration 2.43391 seconds)
sort by parts of 100 rows (Algorithm of #Borodin):
Start 1472568185.21896
End 1472568195.59322 (duration 10.37426 seconds)

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

Using Perl hashes to handle tab-delimited files

I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png
A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.
You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.
The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;
I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;

Perl merging 2 csv files line by line with a primary key

Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};