summarize a table based on certain features in one column - perl

I tried to summarize the following table based a the same features in column 1:
infile:
A m
A m
A n
A n
A m
A c
A m
A i
A n
A n
B n
B n
B n
B n
B n
B n
C o
C i
C q
I wrote the following code, but I do not know why it does not report the last feature
perl code.pl 1 2 infile
use warnings;
use strict;
my $col_feature = $ARGV[0];
my $col_to_be_collapsed = $ARGV[1];
my $infile = $ARGV[2];
open( my $fh1, "<$infile" );
my $temp;
my $line_count = 0;
my %count = ();
my #array = ();
while ( my $line = <$fh1> ) {
chomp($line);
my #line = split( "\t| ", $line );
my $to_be_collapsed = $line[ $col_to_be_collapsed - 1 ];
my $feature = $line[ $col_feature - 1 ];
if ( $line_count >= 1 && $temp ne '' ) {
my #temp = split( "\t| ", $temp );
my $to_be_collapsed_temp = $temp[ $col_to_be_collapsed - 1 ];
my $feature_temp = $temp[ $col_feature - 1 ];
if ( $feature_temp eq $feature ) {
push( #array, $to_be_collapsed );
}
else {
map { $count{$_}++ } #array;
print "$feature_temp:\t";
print "$_:$count{$_}\t" foreach sort { $a cmp $b } keys %count;
%count = ();
#array = ();
$temp = $line;
push( #array, $to_be_collapsed );
print "\n";
}
}
else {
$temp = $line;
push( #array, $to_be_collapsed );
}
$line_count++;
}
#print $temp,"\n";
output:
A: c:1 i:1 m:4 n:4
B: n:6
But there is no any report for C in the first column!!
Thanks

It will be alot easier to use a hash in this particular case as you just need to keep a counter.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
#open my $fh, '<', 'infile'; # Uncomment for live file.
my $fh = \*DATA; # For testing only.
my %counter;
while (<$fh>) {
my ( $outerkey, $innerkey ) = split;
$counter{$outerkey}{$innerkey}++;
}
for my $outerkey ( sort keys %counter ) {
print "$outerkey:";
print "\t$_:$counter{$outerkey}{$_}" for sort keys %{ $counter{$outerkey} };
print "\n";
}
__DATA__
A m
A m
A n
A n
A m
A c
A m
A i
A n
A n
B n
B n
B n
B n
B n
B n
C o
C i
C q
Output:
A: c:1 i:1 m:4 n:4
B: n:6
C: i:1 o:1 q:1

Related

Using perl to calculate mean or median value of a column in a matrix

Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
A_1 18.66530716 0 10.45969216 52.71893547 40.04726048 32.16758825 38.27754435
A_2 25.19816467 0 12.5516306 37.95763354 28.39714834 25.7340706 37.581589
A_3 61.5006053 0 6.807664053 4.57493135 23.69514333 9.304974679 29.44245014
A_4 46.71317515 4.988346264 21.47872616 36.08568845 7.47600779 18.34871344 75.02919728
A_5 38.12488272 0 0 28.71499464 19.82997811 19.46785483 66.33787183
A_6 44.16019386 3.313750449 10.70121259 38.35466425 8.691025042 13.40792311 42.72152213
B_1 38.39720331 13.32601073 0 19.28006783 9.985810405 9.803455466 95.44530538
B_2 46.53021582 1.899838598 24.54086634 13.74342921 24.20186228 6.988206544 47.62545788
B_3 48.42890507 0 6.0308135 20.26433556 20.99119304 10.30393217 64.20344867
A_7 32.10687649 0 20.56239825 23.03079775 9.542753971 10.5395511 44.46513374
B_4 34.82673166 0 6.122746633 39.08916191 8.524472297 14.64540603 54.99744731
B_5 32.49685303 2.910517165 15.66506159 35.79294964 8.723952928 10.7058016 52.11522135
B_6 30.38974634 0 0 30.51870034 10.53778987 17.24225836 50.36058827
B_7 59.60856159 0 8.097826192 19.0468412 2.818575518 11.06841746 10.77608287
A_8 36.07790915 6.260541956 0 31.70212496 14.07396097 4.605650219 67.26011453
C_1 0 17.27445836 0 382.0309737 1.849224149 0 0
C_2 344.0389416 119.4010562 32.13217433 0 22.36821531 285.4766232 21.37974841
C_3 235.5547989 37.86357293 22.23167043 2.490045661 2.579360621 30.38709443 14.79226135
C_4 0 2.801263518 0 334.3615367 0 0 0
C_5 9.397916894 128.2900334 187.2504332 25.16745451 22.81140838 14.39668285 0
Here is the data matrix. Row is variable and column is sample ID.
A_1 - A_8 is clusterA, B_1 - B_7 is clusterB, C_1 - C_5 is clusterC.
Now I wanna calculate the mean or median of A_1 - A_8 as the value of clusterA, getting the median result as:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.10139593 0 10.58045238 33.89390671 16.95196954 15.87831827 43.59332793
Could anyone help me solve this problem using perl script?
Calculate both mean and median:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::Util qw(sum);
use POSIX qw(floor ceil);
my %data = ();
my %avg = ();
my %median = ();
while (<>) {
next if $. == 1;
my #fields = split;
my $cluster = substr($fields[0],0,1);
$data{$cluster} = [] unless exists($data{$cluster});
push #{$data{$cluster}}, [ #fields[1..$#fields] ];
}
for my $cluster (keys(%data)) {
for my $sampleNo (0..scalar(#{$data{$cluster}[0]})-1) {
my #samples = map { $_->[$sampleNo] } #{$data{$cluster}};
my $cnt = #samples;
$avg{$cluster}[$sampleNo] = sum(#samples)/$cnt;
my #sorted = sort #samples;
$median{$cluster}[$sampleNo] = (#sorted[floor(($cnt+1)/2)-1] +
#sorted[ceil(($cnt+1)/2)-1])/2;
}
}
print "Mean\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } #{$avg{$cluster}})),"\n";
}
print "Median\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } #{$median{$cluster}})),"\n";
}
Output:
perl test.pl <sample.txt
Mean
A 37.818389312 1.820329834 10.320165477 31.642471301 18.969159754 16.697040778 50.139427875
B 41.525459546 2.590909499 8.636759179 25.390783670 12.254808048 11.536782519 53.646221676
C 117.798331479 61.126076882 48.322855592 148.810002114 9.921641692 66.052080096 7.234401952
Median
A 37.101395935 0.000000000 11.626421595 37.021660995 34.222204410 22.600962715 43.593327935
B 38.397203310 0.000000000 24.540866340 20.264335560 24.201862280 14.645406030 52.115221350
C 235.554798900 17.274458360 187.250433200 25.167454510 2.579360621 14.396682850 0.000000000
Here is an example of how you can calculate the medians of the clusters:
use feature qw(say);
use strict;
use warnings;
my $fn = 'data.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %clusters;
while (my $line = <$fh>) {
chomp $line;
my ($id, #cols) = split " ", $line;
die "Bad format" if !#cols;
if ( $id =~ /^([A-Za-z]+)_/ ) {
$id = $1;
}
else {
die "Bad ID";
}
if (!exists $clusters{$id} ) {
$clusters{$id} = [];
}
my $samples = $clusters{$id};
for my $i (0..$#cols) {
push #{ $samples->[$i] }, $cols[$i];
}
}
close $fh;
print $header;
for my $id (sort keys %clusters) {
my $samples = $clusters{$id};
my #items;
push #items, sprintf "cluster%s", $id;
for my $sample (#$samples) {
my $median = calculate_median( $sample );
push #items, $median;
}
say join "\t", #items;
}
sub calculate_median {
my ( $sample ) = #_;
my #sorted = sort {$a <=> $b} #$sample;
my $N = scalar #sorted;
my $i = int ($N/2);
if ( $N % 2 == 0 ) {
my $m1 = $sorted[$i-1];
my $m2 = $sorted[$i];
return ($m1 + $m2)/2;
}
else {
return $sorted[$i];
}
}
Output:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.101395935 0 10.580452375 33.893906705 16.95196954 15.878318275 43.593327935
clusterB 38.39720331 0 6.122746633 20.26433556 9.985810405 10.7058016 52.11522135
clusterC 9.397916894 37.86357293 22.23167043 25.16745451 2.579360621 14.39668285 0

Is it possible to count the number of duplicates from two column by using single hash?

My input data as follow. From below data I want to unique the p1 p2 .. p5 and the first column and get the count of those.
ID M N
cc1 1 p1
cc1 10 p2
cc1 10 p2
cc2 1 p1
cc2 2 p5
cc3 2 p1
cc3 2 p4
I expected the result was
ID M p1 p2 p3 p4 p5
cc1 3 1 2 0 0 0
cc3 2 1 0 0 1 0
cc2 2 1 0 0 0 1
For this I tried the hash of hash and hash I'm getting output what I expect. But my doubt is it is possible to do this by using single hash.? Because the same data was stored into the two different hash.
my (%hash,$hash2);
<$fh>;
while (<$fh>)
{
my($first,$second,$thrid) = split("\t");
$hash{$first}{$thrid}++; #I tried $hash{$first}++{$thrid}++ It throws syntax error
$hash2{$first}++; #it is possible to reduce this hash
}
my #ar = qw(p1 p2 p3 p4 p5);
$, = "\t";
print #ar,"\n";
foreach (keys %hash)
{
print "$_\t$hash2{$_}\t";
foreach my $ary(#ar)
{
if(!$hash{$_}{$ary})
{
print "0\t";
}
else
{
print "$hash{$_}{$ary}\t";
}
}
print "\n";
}
No need to use 2 hashes. you can use only hash of hash. I've just modified your code. see that below code.
use strict;
use warnings;
my %hash;
<DATA>;
while (<DATA>)
{
chomp;
my($first,$second,$thrid) = split("\t");
$hash{$first}{$thrid}++; #I tried $hash{$first}++{$thrid}++ It throws syntax error
}
my #ar = qw(p1 p2 p3 p4 p5);
$, = "\t";
print #ar,"\n";
foreach (keys %hash)
{
# print "$_\t$hash2{$_}\t";
my #in = values $hash{$_};
my $cnt = eval(join("+",#in));
print "$_\t$cnt\t";
foreach my $ary(#ar)
{
if(!$hash{$_}{$ary})
{
print "0\t";
}
else
{
print "$hash{$_}{$ary}\t";
}
}
print "\n";
}
You have hash of hash to store data. first keys are id and second keys are N. Simply count values of the id, it gives the total values what you want.
I'd probably do it like that:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %count_of;
#read the header row
chomp( my #header = split ' ', <DATA> );
while (<DATA>) {
my ( $ID, $M, $N ) = split;
$count_of{ $ID }{ $N }++;
}
#print Dumper \%count_of;
#setup the output headers. We could autodetect, but some of these (p3) are entirely empty.
my #p_headers = qw ( p1 p2 p3 p4 p5 );
#if you did want to:
#my #p_headers = sort keys %{{map { $_ => 1 } map { keys %{$count_of{$_}} } keys %count_of }};
#will give p1 p2 p4 p5.
print join "\t", qw ( ID M ), #p_headers, "\n";
foreach my $ID ( sort keys %count_of ) {
my $total = 0;
$total += $_ for values %{ $count_of{$ID} };
print join "\t",
$ID,
$total,
( map { $count_of{$ID}{$_} // 0 } #p_headers ),
"\n";
}
__DATA__
ID M N
cc1 1 p1
cc1 10 p2
cc1 10 p2
cc2 1 p1
cc2 2 p5
cc3 2 p1
cc3 2 p4

Merging several variables together in perl for RNA alignment

My aligned RNA sequence is
gccuucgggc
gacuucgguc
ggcuucggcc
For which I have done the following coding
open(RNAalign, $ARGV[0]) || "Can't open $ARGV[0]: $!\n";
while ($line = <RNAalign>) {
chomp ($line);
push (#line, $line);
}
#covariences=();
foreach $i (#line) {
foreach $j (#line) {
unless ($i eq $j) {
#search1=split("",$i);
#search2=split("",$j);
$k=0;
while($k<scalar(#search1)) {
if (#search1[$k] ne #search2[$k]) {
$string="";
$string="$k: #search1[$k] #search2[$k]\n";
push (#covariences, $string);
}
$k++;
}
}
}
}
This gives me, when printing:
1: c a
8: g u
1: c g
8: g c
1: a c
8: u g
1: a g
8: u c
1: g c
8: c g
1: g a
8: c u
What I want to do is to merge all the similar position while at the same time keeping any different character that they might have. As following (the characters does not have to be in that exact order):
1: c a g
8: g u c
You can create a hash of hashes (HoH) where the keys are the positions and the values are references to hashes whose keys are the characters at each position. A partial structure of your dataset would be the following:
'8' => {
'c' => 1,
'u' => 1,
'g' => 1
},
'1' => {
'c' => 1,
'a' => 1,
'g' => 1
},
'4' => {
'u' => 1
},
Here's code that produces this HoH structure:
use strict;
use warnings;
my ( %hash, $stringNum );
while (<DATA>) {
chomp;
my $i = 0;
$stringNum++;
$hash{ $i++ }{ lc $_ } = 1 for split //;
}
for my $position ( sort { $a <=> $b } keys %hash ) {
if ( keys %{ $hash{$position} } == $stringNum ) {
my #chars = keys %{ $hash{$position} };
print "$position: #chars\n";
}
}
__DATA__
gccuucgggc
gacuucgguc
ggcuucggcc
Output:
1: c a g
8: c u g
In the while loop, the number of strings is counted, and each string is split into its characters to create a HoH. In the for loop, if the number of keys (e.g., "c") equals the number of total strings, each string varies at that position, so the position and those are printed as an instance of covariance.
Hope this helps!
use strict;
use warnings;
use Data::Dumper;
my $s = "gccuucgggc
gacuucgguc
ggcuucggcc";
print "$s\n\n";
my $data = [];
my #lines = split(/\n/,$s);
chomp(#lines);
my $row=0;
my $col=0;
foreach my $line (#lines){
my #chars = split("",$line);
$col = 0;
foreach my $char (#chars){
$data->[$row]->[$col] = $char;
$col++;
}
$row++;
}
#print Dumper($data,$col,$row);
for(my $i=$col-1;$i>=0;$i--){
my $no_diff = 0;
my $result='';my $prev='';
#print "i: $i\n";
for(my $j=$row-1;$j>=0;$j--){
#print Dumper([$i,$j,$prev,$result,$data->[$j]->[$i]]);
if ($prev eq $data->[$j]->[$i]){
$no_diff++;
}
$result .= $data->[$j]->[$i];
$prev = $data->[$j]->[$i];
}
print "$i: $result\n" if !$no_diff;
}

Perl Mismatch among arrays

I have two arrays:
#array1 = (A,B,C,D,E,F);
#array2 = (A,C,H,D,E,G);
The arrays could be of different size. I want to find how many mismatches are there between the arrays. The indexes should be the same. In this case there are three mismatch :b->c,c->h and F->G.(i.e , The 'C' in $array[2] should not be considered a match to 'C' in $array[1]) I would like to get the number of mismatches as well as the mismatch.
foreach my $a1 ( 0 .. $#array1) {
foreach my $a2( 0 .. $#array2)
if($array1[$a1] ne $array2[$a2]) {
}
}
}
my %array_one = map {$_, 1} #array1;
my #difference = grep {!$array_one {$_}} #array1;
print "#difference\n";
Ans: gives me H, G but not C.
with my little Perl knowledge I tried this, with no result. Could you suggest me how I should deal this? Your suggestions and pointers would be very helpful.
You shouldn't have nested loops. You only need to go through the indexes once.
use List::Util qw( max );
my #mismatches;
for my $i (0..max($#array1, $#array2)) {
push #mismatches, $i
if $i >= #array1
|| $i >= #array2
|| $array1[$i] ne $array2[$i];
}
}
say "There are " . (0+#mismatches) . " mismatches";
for my $i (#mismatches) {
...
}
Since you mentioned grep, this is how you'd replace the for with grep:
use List::Util qw( max );
my #mismatches =
grep { $_ >= #array1
|| $_ >= #array2
|| array1[$_] ne $array2[$_] }
0 .. max($#array1, $#array2);
say "There are " . (0+#mismatches) . " mismatches";
for my $i (#mismatches) {
...
}
Here's an example using each_arrayref from List::MoreUtils.
sub diff_array{
use List::MoreUtils qw'each_arrayref';
return unless #_ && defined wantarray;
my #out;
my $iter = each_arrayref(#_);
my $index = 0;
while( my #current = $iter->() ){
next if all_same(#current);
unshift #current, $index;
push #out, \#current;
}continue{ ++$index }
return #out;
}
This version should be faster if you are going to use this for determining the number of differences often. The output is exactly the same. It just doesn't have to work as hard when returning a number.
Read about wantarray for more information.
sub diff_array{
use List::MoreUtils qw'each_arrayref';
return unless #_ && defined wantarray;
my $iter = each_arrayref(#_);
if( wantarray ){
# return structure
my #out;
my $index = 0;
while( my #current = $iter->() ){
next if all_same(#current);
unshift #current, $index;
push #out, \#current;
}continue{ ++$index }
return #out;
}else{
# only return a count of differences
my $out = 0;
while( my #current = $iter->() ){
++$out unless all_same #current;
}
return $out;
}
}
diff_array uses the subroutine all_same to determine if all of the current list of elements are the same.
sub all_same{
my $head = shift;
return undef unless #_; # not enough arguments
for( #_ ){
return 0 if $_ ne $head; # at least one mismatch
}
return 1; # all are the same
}
To get just the number of differences:
print scalar diff_array \#array1, \#array2;
my $count = diff_array \#array1, \#array2;
To get a list of differences:
my #list = diff_array \#array1, \#array2;
To get both:
my $count = my #list = diff_array \#array1, \#array2;
The output for the input you provided:
(
[ 1, 'B', 'C' ],
[ 2, 'C', 'H' ],
[ 5, 'F', 'G' ]
)
Example usage
my #a1 = qw'A B C D E F';
my #a2 = qw'A C H D E G';
my $count = my #list = diff_array \#a1, \#a2;
print "There were $count differences\n\n";
for my $group (#list){
my $index = shift #$group;
print " At index $index\n";
print " $_\n" for #$group;
print "\n";
}
You're iterating over both arrays when you don't want to be doing so.
#array1 = ("A","B","C","D","E","F");
#array2 = ("A","C","H","D","E","G");
foreach my $index (0 .. $#array1) {
if ($array1[$index] ne $array2[$index]) {
print "Arrays differ at index $index: $array1[$index] and $array2[$index]\n";
}
}
Output:
Arrays differ at index 1: B and C
Arrays differ at index 2: C and H
Arrays differ at index 5: F and G
Well, first, you're going to want to go over each element of one of the arrays, and compare it to the same element of the other array. List::MoreUtils provides an easy way to do this:
use v5.14;
use List::MoreUtils qw(each_array);
my #a = qw(a b c d);
my #b = qw(1 2 3);
my $ea = each_array #a, #b;
while ( my ($a, $b) = $ea->() ) {
say "a = $a, b = $b, idx = ", $ea->('index');
}
You can extend that to find where there is a non-match by checking inside that while loop (note: this assumes your arrays don't have undefs at the end, or that if they do, undef is the same as having a shorter array):
my #mismatch;
my $ea = each_array #a, #b;
while ( my ($a, $b) = $ea->() ) {
if (defined $a != defined $b || $a ne $b) {
push #mismatch, $ea->('index');
}
}
and then:
say "Mismatched count = ", scalar(#mismatch), " items are: ", join(q{, }, #mismatch);
The following code builds a list of mismatched pairs, then prints them out.
#a1 = (A,B,C,D,E,F);
#a2 = (A,C,H,D,E,G);
#diff = map { [$a1[$_] => $a2[$_]] }
grep { $a1[$_] ne $a2[$_] }
(0..($#a1 < $#a2 ? $#a1 : $#a2));
print "$_->[0]->$_->[1]\n" for #diff
You have the right idea, but you only need a single loop, since you are looking at each index and comparing entries between the arrays:
foreach my $a1 ( 0 .. $#array1) {
if($array1[$a1] ne $array2[$a1]) {
print "$a1: $array1[$a1] <-> $array2[$a1]\n";
}
}

Majority Voting in perl?

I have 5 files containing the same words. I want to read each word in all the files and decide the winning word by detecting the following characters in a word (*, #, $, &) separated by tabs. Then, I want to generate an output file. Ii can only have 2 winners. For example:
file1
we$
are*
...
file2
we$
are#
...
file3
we&
are*
...
file4
we$
are#
...
file5
we$
are&
...
output file:
we$
are*#
Here is how I started:
#!/usr/local/bin/perl -w
sub read_file_line {
my $fh = shift;
if ($fh and my $line = <$fh>) {
chomp($line);
return $line;
}
return;
}
open(my $f1, "words1.txt") or die "Can't";
open(my $f2, "words2.txt") or die "Can't";
open(my $f3, "words3.txt") or die "Can't";
open(my $f4, "words4.txt") or die "Can't";
open(my $f5, "words5.txt") or die "Can't";
my $r1 = read_file_line($f1);
my $r2 = read_file_line($f2);
my $r3 = read_file_line($f3);
my $r4 = read_file_line($f4);
my $r5 = read_file_line($f5);
while ($f5) {
#What can I do here to decide and write the winning word in the output file?
$r1 = read_file_line($f1);
$r2 = read_file_line($f2);
$r3 = read_file_line($f3);
$r4 = read_file_line($f4);
$r5 = read_file_line($f5);
}
Test Data Generator
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $i (1..5)
{
my $file = "words$i.txt";
open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
foreach my $w (qw (we are the people in charge and what we say goes))
{
my $suffix = substr('*#$&', rand(4), 1);
print $fh "$w$suffix\n";
}
}
Majority Voting Code
#!/usr/bin/env perl
use strict;
use warnings;
my #files = ( "words1.txt", "words2.txt", "words3.txt",
"words4.txt", "words5.txt"
);
my #fh;
{
my $n = 0;
foreach my $file (#files)
{
open my $f, '<', $file or die "Can't open $file for reading ($!)";
$fh[$n++] = $f;
}
}
while (my $r = process_line(#fh))
{
print "$r\n";
}
sub process_line
{
my(#fhlist) = #_;
my %words = ();
foreach my $fh (#fhlist)
{
my $line = <$fh>;
return unless defined $line;
chomp $line;
$words{$line}++;
}
my $combo = '';
foreach my $word (keys %words)
{
return $word if ($words{$word} > 2);
$combo .= $word if ($words{$word} == 2);
}
$combo =~ s/(\W)\w+(\W)/$1$2/;
return $combo;
}
Example Data and Results
$ perl datagenerator.pl
$ perl majorityvoter.pl > results.txt
$ paste words?.txt results.txt
we* we$ we& we# we# we#
are* are# are# are* are$ are*#
the* the& the# the# the& the&#
people& people& people$ people# people# people&#
in# in* in$ in* in* in*
charge* charge# charge& charge* charge# charge#*
and$ and* and$ and& and$ and$
what& what& what$ what& what# what&
we# we* we* we& we* we*
say$ say& say$ say$ say$ say$
goes$ goes& goes# goes# goes# goes#
$
This seems to be correct for the test data in the files generated.
Revised requirements - example output
The 'revised requirements' replaced the '*#$&' markers after the words with a tab and one of the letters 'ABCD'. After some swift negotiation, the question is restored to its original form. This output is from a suitably adapted version of the answer above - 3 code lines changed, 2 in the data generator, 1 in the majority voter. Those changes are not shown - they are trivial.
we C we D we C we C we D we C
are C are D are C are B are A are C
the B the D the A the A the D the A|D
people D people B people A people B people D people B|D
in D in B in C in B in D in D|B
charge C charge D charge D charge D charge A charge D
and A and B and C and C and B and B|C
what B what B what B what C what C what B
we D we B we D we B we A we B|D
say D say D say B say D say D say D
goes A goes C goes A goes C goes A goes A
Revised test generator - for configurable number of files
Now that the poster has worked out how to handle the revised scenario, this is the data generator code I used - with 5 tags (A-E). Clearly, it would not take a huge amount of work to configure the number of tags on the command line.
#!/usr/bin/env perl
use strict;
use warnings;
my $fmax = scalar(#ARGV) > 0 ? $ARGV[0] : 5;
my $tags = 'ABCDE';
my $ntags = length($tags);
my $fmt = sprintf "words$fmax-%%0%0dd.txt", length($fmax);
foreach my $fnum (1..$fmax)
{
my $file = sprintf $fmt, $fnum;
open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
foreach my $w (qw(We Are The People In Charge And What We Say Goes))
{
my $suffix = substr($tags, rand($ntags), 1);
print $fh "$w\t$suffix\n";
}
}
Revised Majority Voting Code - for arbitrary number of files
This code works with basically arbitrary numbers of files. As noted in one of the (many) comments, it does not check that the word is the same in each file as required by the question; you could get quirky results if the words are not the same.
#!/usr/bin/env perl
use strict;
use warnings;
my #files = scalar #ARGV > 0 ? #ARGV :
( "words1.txt", "words2.txt", "words3.txt",
"words4.txt", "words5.txt"
);
my $voters = scalar(#files);
my #fh;
{
my $n = 0;
foreach my $file (#files)
{
open my $f, '<', $file or die "Can't open $file for reading ($!)";
$fh[$n++] = $f;
}
}
while (my $r = process_line(#fh))
{
print "$r\n";
}
sub process_line
{
my(#fhlist) = #_;
my %words = ();
foreach my $fh (#fhlist)
{
my $line = <$fh>;
return unless defined $line;
chomp $line;
$words{$line}++;
}
return winner(%words);
}
# Get tag X from entry "word\tX".
sub get_tag_from_word
{
my($word) = #_;
return (split /\s/, $word)[1];
}
sub winner
{
my(%words) = #_;
my $maxscore = 0;
my $winscore = ($voters / 2) + 1;
my $winner = '';
my $taglist = '';
foreach my $word (sort keys %words)
{
return "$word\t$words{$word}" if ($words{$word} >= $winscore);
if ($words{$word} > $maxscore)
{
$winner = $word;
$winner =~ s/\t.//;
$taglist = get_tag_from_word($word);
$maxscore = $words{$word};
}
elsif ($words{$word} == $maxscore)
{
my $newtag = get_tag_from_word($word);
$taglist .= "|$newtag";
}
}
return "$winner\t$taglist\t$maxscore";
}
One Example Run
After considerable experimentation on the data presentation, one particular set of data I generated gave the result:
We A|B|C|D|E 2 B C C E D A D A E B
Are D 4 C D B A D B D D B E
The A 5 D A B B A A B E A A
People D 4 E D C D B E D D B C
In D 3 E C D D D B C A A B
Charge A|E 3 E E D A D A B A E B
And E 3 C E D D C A B E B E
What A 5 B C C A A A B A D A
We A 4 C A A E A E C D A E
Say A|D 4 A C A A D E D A D D
Goes A 3 D B A C C A A E E B
The first column is the word; the second is the winning tag or tags; the third (numeric) column is the maximum score; the remaining 10 columns are the tags from the 10 data files. As you can see, there two each of 'We A', 'We B', ... 'We E' in the first row. I've also generated (but not preserved) one result set where the maximum score was 7. Given enough repetition, these sorts of variations are findable.
Sounds like the job for a hash of hashes. Untested code:
use strict;
use warnings;
use 5.010;
use autodie;
use List::Util qw( sum reduce );
my %totals;
my #files = map "words$_.txt", 1..5;
for my $file (#files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($word, $sign) = /(\w+)(\W)/;
$totals{$word}{$sign}++;
}
}
open my $totals_fh, '>', 'outfile.txt';
my #sorted_words = sort { sum values %{$totals{$a}} <=> sum values %{$totals{$b}} } keys %totals; #Probably something fancier here.
for my $word (#sorted_words[0, 1]) {
#say {$totals_fh} $word, join('', keys %{$totals{$word}} ), "\t- ", function_to_decide_text($totals{$word});
say {$totals_fh} $word, reduce {
$totals{$word}{ substr $a, 0, 1 } == $totals{$word}{$b} ? $a . $b
: $totals{$word}{ substr $a, 0, 1 } > $totals{$word}{$b} ? $a
: $b;
} keys %{ $totals{$word} };
}
EDIT: Forgot about the only two winners part. Fixed, somewhat.
EDIT2: Fixed as per comments.
#!/usr/bin/perl
use strict;
use warnings;
my #files = qw(file1 file2 file3 file4 file5);
my $symbols = '*#$&'; # no need to escape them as they'll be in a character class
my %words;
foreach my $file (#files) {
open(my $fh, '<', $file) or die "Cannot open $file: $!";
while (<$fh>) {
if (/^(\w+[$symbols])$/) {
$words{$1} ++; # count the occurrences of each word
}
}
close $fh;
}
my $counter = 0;
my $previous = -1;
foreach my $word (sort {$words{$b} <=> $words{$a}} keys %words) {
# make sure you don't exit if two words at the top of the list
# have the same number of occurrences
if ($previous != $words{$word}) {
last if $counter > 1;
}
$counter ++; # count the output
$previous = $words{$word};
print "$word occurred $words{$word} times.\n";
}
Worked when I tried it out...