remove duplicated lines and sort the table in Perl - perl

I have a table like this:
+ chr13 25017807 6
+ chr10 128074490 1
- chr7 140968671 1
+ chr10 79171976 3
- chr7 140968671 1
+ chr12 4054997  6
+ chr13 25017807 6
+ chr15 99504255 6
- chr8 91568709 5
It has been already read into Perl as a string variable (the returned value of an external shell script). I need to remove the duplicated lines and sort the table by the last column, and then print it out. How should I do it in Perl? Thanks!

Assuming the data is contained in the string $string, this solution would work:
my %seen; # just needed to remove duplicates
my $deduped_string =
join "\n", # 6. join the lines to a single string
map { join(" ", #$_) } # 5. join the fields of each line to a string
sort { $a->[-1] <=> $b->[-1] } # 4. sort arrayrefs by last field, numerically
map { [split] } # 3. split line into fields, store in anon arrayref
grep { not $seen{$_}++ } # 2. dedupe the lines
split /\n/, $string; # 1. split string into lines
This gargantuan expression executes from bottom towards top (or right to left). It consists of multiple composable transformers and filters:
map {BLOCK} LIST applies the code in the block to each value of the list. It transforms the list element-wise.
grep {BLOCK} LIST selects those elements from the list where the block returns true. It therefore filters the list and only outputs elements that satisfy a certain condition.
sort {BLOCK} LIST resorts the list. The block must return -1 if $a is less than $b, 1 if it is greater, or zero when equal. The <=> operator compares scalars numerically in this fashion. If the sort function is omitted, string comparision is used.
join STRING, LIST concatenates the elements of the list with the string in between.
split REGEX, STRING splits the string into pieces. The regex matches the delimiter (not usually returned). split and join can be considered inverse operations. If the string is omitted, $_ is used. When the regex is omitted, it works similarly to split /\s+/, $_, i.e. splits at every whitespace character.
This solution uses at its heart the Schwartzian Transform, a technique/idiom that enables cheap sorting by expensive-to-calculate keys. In it's general form, it is
my #sorted_data =
map { $_->[0] } # 3. map back to the orginal value
sort { $a->[1] <=> $b->[1] } # 2. sort by the special key
map { [$_, create_the_key($_)] } # 1. annotate each value with a key
#data;
In my specific case, the special key is the last column of each record; To obtain the original data (or an equivalent form) from the annotated data I join the fields together. As mpapec points out, I could also have carried the original line through the transform; this would preserve the original alignment of the lines.

For a beginner, I'd do it like this:
use strict; use warnings;
my $file = "table.txt";
open(my $fh, "<", $file) || die "Can't open $file: $!\n";
my #lines;
# read the file and save a transformed version to #lines
while (my $line = <$fh>) {
chomp($line); # remove final newline
$line =~ s/ +/:/gi; # make ":" the new separator
my #fields = split(/:/,$line); # split at the separator
my $newline = "$fields[4]:$fields[1]:$fields[2]:$fields[3]"; # reorder fields
push(#lines, $newline); # save the new line
}
#lines = sort(#lines); # sort lines alphabetically:
# duplicate lines are now consecutive
my $uniqline=""; # the last unique line
foreach my $line (#lines) {
# do this if the current line isn't string-equal to the last line
# (i.e. skip all lines that are equal to the previous line)
if ($uniqline ne $line) {
$uniqline = $line; # remember the last line
# print fields in original order
my #fields = split(/:/,$line);
printf(" %s %7s %11s %s\n",$fields[1],$fields[2],$fields[3],$fields[0]);
}
}
I got slightly different result...
+ chr10 128074490 1
- chr7 140968671 1
+ chr10 79171976 3
- chr8 91568709 5
+ chr12 4054997 6
+ chr13 25017807 6
+ chr15 99504255 6

Filters out duplicate lines, and sorts by the last column at the end,
perl -ane 'next if $s{$_}++; push #r,[$_,#F]}{ print $$_[0] for sort { $$a[-1] <=> $$b[-1] } #r' file
Almost same as,
use strict;
use warnings;
open my $fh, "file" or die $!;
my (%seen_line, #result_unique_lines);
while (<$fh>) {
# $_ => content of current line
# skip current if it's duplicate
next if $seen_line{$_}++;
my #line_values = split;
push #result_unique_lines, [$_, #line_values];
}
close $fh;
# sort lines
#result_unique_lines = sort { $a->[-1] <=> $b->[-1] } #result_unique_lines;
for my $aref (#result_unique_lines) {
my $line = $aref->[0];
print $line;
}

Related

Sort comma-delimited file by three columns with custom criteria in Perl

I have a comma-delimited, text file. I want to sort the file by the 3rd column first, then the 2nd column, then the 1st column.
However, I want the 3rd column to be sorted alphabetically, with the longest value first.
For example, AAA, then AA, then A, then BBB, then BB, then B, then CCC, then CC, and so on.
Input (alpha-sort-test2.txt):
JOHN,1,A
MARY,3,AA
FRED,5,BBB
SAM,7,A
JOHN,3,AAA
JOHN,2,AAA
BETTY,2,AAA
JARROD,7,AAA
JOANNE,2,BB
AMANDA,2,DD
AMY,5,B
PETE,7,CC
MATT,4,B
SARAH,3,CCC
GEORGE,3,CC
AMANDA,3,AAA
The Perl code that I have so far is as follows:
$infile = "alpha-sort-test2.txt";
$outfile = "alpha-sort-test-sorted2.txt";
open (INFILE, "<$infile") or die "Could not open file $infile $!";
open (OUTFILE, ">$outfile");
my #array = sort howtosort <INFILE>;
foreach (#array)
{
chomp;
print "$_\n";
print OUTFILE "$_\n";
}
sub howtosort
{
my #flds_a = split(/,/, $a);
my #flds_b = split(/,/, $b);
$flds_a[2] cmp $flds_b[2];
}
close INFILE;
close OUTFILE;
Current output (alpha-sort-test-sorted2.txt):
JOHN,1,A
SAM,7,A
MARY,3,AA
AMANDA,3,AAA
JOHN,3,AAA
JOHN,2,AAA
BETTY,2,AAA
JARROD,7,AAA
AMY,5,B
MATT,4,B
JOANNE,2,BB
FRED,5,BBB
PETE,7,CC
GEORGE,3,CC
SARAH,3,CCC
AMANDA,2,DD
Desired output:
BETTY,2,AAA
JOHN,2,AAA
AMANDA,3,AAA
JOHN,3,AAA
JARROD,7,AAA
MARY,3,AA
JOHN,1,A
SAM,7,A
FRED,5,BBB
JOANNE,2,BB
MATT,4,B
AMY,5,B
SARAH,3,CCC
GEORGE,3,CC
PETE,7,CC
AMANDA,2,DD
Thanks in advance.
There's a little complication with that criterion for the third field.
Lexicographical comparison goes char by char, so abc is lesser-than ax but longer strings are greater, with all else equal. So ab is lesser-than b but ab is greater-than a.
Thus that requirement for the third field mixes these two things and breaks cmp right down the middle. If we were to use cmp then ab comes before b (correct) but aa comes after a (not wanted). I don't see how to make use of cmp at all for that requirement.
So here's a very basic implementation of it, for these criteria
use warnings;
use strict;
use feature 'say';
use Path::Tiny qw(path); # convenience
my $file = shift // die "Usage: $0 file\n";
my #lines = path($file)->lines({ chomp => 1 });
my #sorted =
map { $_->[0] }
sort { custom_sort($a, $b) }
map { [$_, split /,/] }
#lines;
say for #sorted;
sub custom_sort {
my ($aa, $bb) = #_;
# Last field for both terms, their lengths
my ($af, $bf) = map { $_->[-1] } $aa, $bb;
my ($len_a, $len_b) = map { length } $af, $bf;
# Strip and return first characters and compare them lexicographically
# Then compare lengths of original strings if needed
# Keep going until difference is found or one string is depleted
while (
(my $ca = substr $af, 0, 1, "") and
(my $cb = substr $bf, 0, 1, "") )
{
if ($ca gt $cb) {
return 1
}
elsif ($ca lt $cb) {
return -1;
}
elsif ($len_a < $len_b) {
return 1
}
elsif ($len_a > $len_b) {
return -1
}
}
# Still here, so third field was the same; use other two criteria
return
$aa->[2] <=> $bb->[2]
||
$aa->[1] cmp $bb->[1];
}
This prints out the desired list.
Some comments
Before invoking sort we first form an arrayref, with the whole string and its individual fields, so that the string need not be split later on every single comparison; this is Schwartzian transform
Criterion for the third-field: compare character by character alphabetically until a difference is found; if one string is contained in the other then the longer one wins. So the char-by-char comparison of abc and ab stops at b and abc 'wins'
The (optional) fourth argument in substr is the replacement for the returned substring, found per the second and third argument. So here an empty string replaces one-long substring that starts at 0 -- it removes and returns the first character. This is quite like using shift on an array
If the third fields are exactly the same then the second fields are compared numerically and if they are the same then the first fields are compared alphabetically
After the comparison we retrieve the original string from the sorted arrayrefs

Sorting 5th column in descending order error message

The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.

Perl sort on numbers

I have some perl script to process a file (contains lots of numbers) line by line.
File content (sample data, first 3 numbers are seperated by space, then the seperate is tab between 3rd and 4th numbers):
1 2 3 15
2 9 8 30
100 106 321 92
9 8 2 59
300 302 69 88
....
Script content:
# snippet of script
open(INF, "$infile") || die "Unable to open file $infile: $!\n";
#content = <INF>;
close(INF);
foreach $line (#content) {
# blah blah, script to handle math here
# Now the numbers are stored in separate variables
# $n1 stores the 1st number, i.e.: 1
# $n2 stores the 2nd number, i.e.: 2
# $n3 stores the 3rd number, i.e.: 3
# $n4 stores the 4th number, i.e.: 15
# Solution code to be inserted here
}
I would like to:
Sort the variables $n1, $n2, $n3 and output them in ascending order.
At the end of foreach, get rid of duplicates
My approach:
# Insert below code to foreach
$numbers{$n1} = 1;
$numbers{$n2} = 1;
$numbers{$n3} = 1;
#keys = sort { $numbers{$b} <=> $numbers{$a} } keys %numbers;
#push #numbers, "$keys[0] $keys[1] $keys[2]";
$numbers2{"$keys[0] $keys[1] $keys[2]"} = 1;
This defines two hashes: 1st hash is for sorting, 2nd hash is for removing duplicates after sorting.
Is there any better approach?
Thanks,
Updated with another solution -- it is the lines that may be duplicate, not numbers on a line.
In order to remove duplicate lines it is easiest if we have all sorted lines of three numbers in an array. Then we post-process that by running them through uniq. There are (at least) two ways to do this.
Store lines in an array, each being a reference to a sorted array with the three numbers. Then for comparison construct a string of each on the fly. This is better if there is yet other processing of numbers somewhere along, as they are in arrays.
Build a string out of each sorted line and store those in an array. Then it's easier to compare.
Below I use the first approach, assuming that there is other processing of numbers.
use warnings;
use strict;
use feature wq(say);
use List::MoreUtils qw(uniq);
my $file = 'sort_nums.txt';
my #content = do {
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #linerefs_all;
foreach my $line (#content) {
# Calculations ... numbers stored in ($n1, $n2, $n3, $n4)
my ($n1, $n2, $n3) = split '\s+' $line; # FOR TESTING
# Add to #rlines a reference to the sorted array with first three
push #linerefs, [ sort { $a <=> $b } ($n1, $n2, $n3) ];
}
# Remove dupes by comparing line-arrays as strings, then remake arrayrefs
my #linerefs = map { [ split ] } uniq map { join ' ', #$_ } #linerefs_all;
say "#$_" for #linerefs;
Using the posted lines in the file sort_nums.txt, the code above prints
1 2 3
2 8 9
100 106 321
69 300 302
Explanation of the post-processing line, read from the right.
The map on the right processes a list of arrayrefs. It dereferences each and joins its elements with a space, forming a string for the line. It returns a list of such strings, one per line.
That list is pruned of duplicates by uniq, which itself returns a list, fed into the map on the left.
In that map's block each string is split by (the default) white space into a list (of numbers on the line), and then a reference of that is taken by [ ]. This map thus returns a list of references to arrays, one for each line, what is assigned to #linerefs.
This is then printed.
If this is too much to stomach for one statement break the process into steps, generating intermediate arrays. Or switch to the second method above.
Initial post, assuming that numbers on each line may be duplicates
I take the objective to be: sort three variables and keep only unique ones, for each line.
use List::MoreUtils qw(uniq);
foreach my $line (#content) {
# Calculations, numbers stored in ($n1, $n2, $n3, $n4)
my #nums = uniq sort { $a <=> $b } ($n1, $n2, $n3);
say "#nums";
}
Remember that after this you don't know which one(s) of $n1, $n2, $n3 may have been dropped.
If, for some reason, a non-core module is not suitable, see this in perlfaq4 For example,
my %seen = ();
my #nums = sort { $a <=> $b } grep { ! $seen{$_}++ } ($n1, $n2, $n3);
or, if you need it without an extra hash around
my #nums = do { my %seen; sort { $a <=> $b } grep { !$seen{$_}++ } ($n1, $n2, $n3) };

Need to extract value from a file and subtract from variable in Perl

FILE:
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
code:
use IPC::System::Simple qw( capture capturex );
use POSIX;
my $tb1_file = '/var/egridmanage_pl/daily_pl/egrid-csv/test.csv';
open my $fh1, '<', $tb1_file or die qq{Unable to open "$tb1_file" for input: $!};
my #t1_temp_12 = map {
chomp;
my #t1_ft_12 = split /,/;
sprintf "%.0f", $t1_ft_12[6] if $t1_ft_12[2] eq '12:00:00';
} <$fh1>;
print "TEMP #t1_temp_12\n";
my $result = #t1_temp_12 - 273.14;
print "$result should equal something closer to 24 ";
$result value prints out -265.14 making me think the #t1_temp_12 is hashed
So I tried to do awk
my $12temp = capture("awk -F"," '$3 == "12:00:00" {print $7 - 273-.15}' test.csv");
I've tried using ``, qx, open, system all having the same error result using the awk command
But this errors out. When executing awk at command line i get the favoured results.
This looks like there's some cargo cult programming going on here. It looks like all you're trying to do is find the line for 12:00:00 and print the temperature in degrees C rather than K.
Which can be done like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
my #fields = split /,/;
print $fields[6] - 273.15 if $fields[2] eq "12:00:00";
}
__DATA__
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
Prints:
25.72
You don't really need to do map sprintf etc. (Although you could do a printf on that output if you do want to format it).
Edit: From the comments, it seems one of the sources of confusion is extracting an element from an array. An array is zero or more scalar elements - you can't just assign one to the other, because .... well, what should happen if there isn't just one element (which is the usual case).
Given an array, we can:
pop #array will return the last element (and remove it from the array) so you could my $result = pop #array;
[0] is the first element of the array, so we can my $result = $array[0];
Or we can assign one array to another: my ( $result ) = #array; - because on the left hand side we have an array now, and it's a single element - the first element of #array goes into $result. (The rest isn't used in this scenario - but you could do my ( $result, #anything_else ) = #array;
So in your example - if what you're trying to do is retrieve a value matching a criteria - the normal tool for the job would be grep - which filters an array by applying a conditional test to each element.
So:
my #lines = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print "#lines";
print $lines[0];
Which we can reduce to:
my ( $firstresult ) = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $firstresult;
But as we want to want to transform our array - map is the tool for the job.
my ( $result ) = map { (split /,/)[6] - 273.15 } grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $result;
First we:
use grep to extract the matching elements. (one in this case, but doesn't necessarily have to be!)
use map to transform the list, so that that we turn each element into just it's 6th field, and subtract 273.15
assign the whole lot to a list containing a single element - in effect just taking the first result, and throwing the rest away.
But personally, I think that's getting a bit complicated and may be hard to understand - and would suggest instead:
my $result;
while (<DATA>) {
my #fields = split /,/;
if ( $fields[2] eq "12:00:00" ) {
$result = $fields[6] - 273.15;
last;
}
}
print $result;
Iterate your data, split - and test - each line, and when you find one that matches the criteria - set $result and bail out of the loop.
#t1_temp_12 is an array. Why are you trying to subtract an single value from it?
my $result = "#t1_temp_12 - 273.14";
Did you want to do this instead?
#t1_temp_12 = map {$_ - 273.14} #t1_temp_12;
As a shell one-liner, you could write your entire script as:
perl -F, -lanE 'say $F[6]-273.14 if $F[2] eq "12:00:00"' <<DATA
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
DATA
25.73

How to compare two text files by column and output the number of times the columns match

I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.
Here is an example of input file format:
HWI-AT555:86:D0:6:2208:13551:55125 122 chr1 77028 255 94M555N7M * 0 0 GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&& NM:i:3 XS:A:+ NH:i:1
The important part is the sequence read id, which is the first column (ie HWI-....55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.
Here is what I have so far:
unless (#ARGV == 2) {
print "Use as follows: perl program.pl in1.file in2.file\n";
die;
}
my $in1 = $ARGV[0];
my $in2 = $ARGV[1];
open ONE, $in1;
open TWO, $in2;
my %hash1;
my #hit;
while (<ONE>){
chomp;
my #hit = split(/\t/, $_);
$hash1{$hit[0]}=1;
}
close ONE;
my #col;
while (<TWO>){
chomp;
my #col = split(/\t/, $_);
if ($col[0] =~ /^H/){ #only valid sequence read lines start with "H"
print "$col[0]\n" if defined($hash1{$_});
}
}
close TWO;
So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.
I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn't make something clear enough.
Initialize your %hash1 with zeros instead of ones:
while (<ONE>){
chomp;
my #hit = split(/\t/, $_);
# Start them as "0" for "no duplicates".
$hash1{$hit[0]} = 0;
}
Then, in your second loop, you can increment $hash1{$col[0]}:
while (<TWO>){
chomp;
my #col = split(/\t/, $_);
# Increment the counter if %hash1 has what we're looking for.
++$hash1{$col[0]} if(exists($hash1{$col[0]}));
}
There's no need to check $col[0] =~ /^H/ since %hash1 will only have entries for valid sequences, so you can just do an exists check on the hash. And you want to look at $hash1{$col[0]} rather than $hash1{$_} since you're only storing the first part of the lines in your first loop, $_ will have the whole line. Furthermore, if you're just grabbing the first field of each line you don't need the chomp calls but they do no harm so you can keep them if you want.
This leaves you with the all the repeated entries in %hash1 as entries with non-zero values and you can grep those out:
my #dups = grep { $hash1{$_} > 0 } keys %hash1;
And then display them with their counts:
for my $k (sort #dups) {
print "$k\t$hash1{$k}\n";
}
You could also check the counts while displaying the matches:
for my $k (sort keys %hash1) {
print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
}