Parsing File Delimited Vertically in Perl - perl

I have a file that looks like this:
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
Each record chunk has different number of lines, (e.g. CX entry does not always present).
But if CX exists, in only appear as 1 entry only.
We want to get a Hash that takes "MH" as keys and "CX" as values.
Hence parsing the above data we hope to get this structure:
$VAR = { "Urinary Bladder" => ["CYST-" , "VESIC-"]};
What's the right way to parse it?
I'm stuck with this, that doesn't give me result as I want.
use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {
chomp;
if (/^MH = (\w+/)) {
$key = $1;
push #{$bighash{$key}}, " ";
}
elsif ( /^CX = (\w+/)) {
$cx = $1;
}
else {
push #{$bighash{$key}}, $cx;
}
}

This becomes simpler if you use $/ to read the data a paragraph at a time. I'm surprised that no-one else has suggested that.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %bighash;
$/ = '';
while (<DATA>) {
if (my ($k) = /^MH = (.*?)$/m and my ($v) = /^CX = (.*?)$/m) {
$bighash{$k} = [ $v =~ /([A-Z]+-)/g ];
}
}
say Dumper \%bighash;
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
The output looks like this:
$VAR1 = {
'Urinary Bladder' => [
'CYST-',
'VESIC-'
]
};

Try the following. And it's probably a good idea to examine the changes (or listen to Aki):
use strict;
use warnings;
use Data::Dumper;
my %bighash;
my $current_key;
while ( <DATA> ) {
chomp;
if ( m/^MH = (.+)/ ) {
$current_key = $1;
} elsif ( /^CX = (.+)/ ) {
my $text = $1;
$bighash{ $current_key } = [ $text =~ /([A-Z]+-)/g ];
}
}
print Dumper ( \%bighash );
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
Update: Used Regex-Captures instead of split and grep

Haven't practiced my perl kung fu lately but the last else statement looks fishy.
Try dropping the last else statement and add the 'push' statement straight after the second elsif. Basically do the push operation straight after matching the CX.
Also, you know that MH must always appear before a CX otherwise the logic breaks.

Fix the regular expressions
/^MH = (\w+/) should be /^MH (\w+)/. You may want to use \s+ or \s* instead of space
Remove push from the if block
Remove else block
In the elsif block Push $cx into hash using the key $key
List item
Add use strict; and use warnings; to your code
Try these and if you have difficulty i will help you with the code

It might be simpler to use Config::Tiny or Config::YAML to do an initial pass over the file and then loop through each record individually. Although if your file is like a gigabyte or more this might suck up all your memory.

Here is something I quickly did, I hope it gives you an idea to start from:
use Data::Dumper;
# Set your record separator
{
local $/="*NEWRECORD\n";
while(<DATA>) {
# Get rid of your separator
chomp($_);
print "Parsing record # $.\n";
push #records, $_ if ( $_ );
}
}
foreach (#records) {
# Get your sub records
#lines = split(/\n/,$_);
my %h = ();
my %result = ();
# Create a hash from your sub records
foreach (#lines) {
($k, $v) = split(/\s*=\s*/, $_);
$h{$k} = $v;
}
# Parse the CX and strip the lower case comments
$h{ 'CX' } =~ s/[a-z]//g;
$h{ 'CX' } =~ s/^\s+//g;
# Have the upper case values as an array ref in the result hash
$result{ $h{ 'MH' } } = [ split( /\s+/, $h{ 'CX' } ) ] if ( $h{ 'CX' } );
print Dumper( \%h );
print "Result:\n";
print Dumper( \%result );
}
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

Related

Cant get Weighted Cosine Similarity to work

I'm trying to get the Weighted Cosine Similarity of two documents. I'm using Text::Document and Text::DocumentCollection. My code seems to work but it isn't returning a number as I expected.
Here is my code
use strict;
use warnings;
use Text::Document;
use Text::DocumentCollection;
my $newfile = shift #ARGV;
my $newfile2 = shift #ARGV;
##This is in another file.
my $t1 = countFreq($newfile);
my $t2 = countFreq($newfile2);
my $collection = Text::DocumentCollection->new(file => 'coll.db');
$collection->Add("One", $t1);
$collection->Add("Two", $t2);
my $wSim = $t1->WeightedCosineSimilarity( $t2,
\&Text::DocumentCollection::IDF,
$collection
);
print "\nWeighted Cosine Sim is: $wSim\n";
All this returns is Weighted Cosine Sim is: without anything following the colon.
Here is the code for countFreq:
sub countFreq{
my ($file) = #_;
my $t1 = Text::Document->new();
open (my $info, $file) or die "Could not open file.";
while (my $line = <$info>) {
chomp $line;
$line =~ s/[[:punct:]]//g;
foreach my $str (split /\s+/, $line) {
if (!defined $sp{lc($str)}) {
$t1 -> AddContent ($str);
}
}
}
return $t1;
}
###Update
Here's an example program that works fine. It's based on looking at the test code in the distribution for inspiration
I was expecting the test to be much less sensitive, so I was getting zeroes from two wildly different text sources. This example adds three short sentences $d1, $d1, and $d3, to a collection $c, and then compares each of the three documents to $d1
Comparing $d1 to itself produces 1 -- an exact match, as expected, while comparing $d2and $d3 gives 0.087 and 0 respectively -- a partial match and no match at all
I hope this helps you to resolve your specific issue?
use strict;
use warnings 'all';
use Text::Document;
use Text::DocumentCollection;
my $d1 = Text::Document->new;
$d1->AddContent( 'my heart belongs to sally webster' );
my $d2 = Text::Document->new;
$d2->AddContent( 'my heart belongs to the girl next door' );
my $d3 = Text::Document->new;
$d3->AddContent( 'I want nothing to do with my neighbours' );
my $c = Text::DocumentCollection->new( file => 'coll2.db' );
$c->Add('one', $d1);
$c->Add('two', $d2);
$c->Add('three', $d3);
for my $doc ( $d1, $d2, $d3 ) {
my $wcs = $d1->WeightedCosineSimilarity(
$doc,
\&Text::DocumentCollection::IDF,
$c
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wcs;
print $wcs, "\n";
}
###output
1
0.0874311036726221
0
This is the code for Text::Document::WeightedCosineSimilarity
# this is rather rough
sub WeightedCosineSimilarity
{
my $self = shift;
my ($e,$weightFunction,$rock) = #_;
my ($Dv,$Ev) = ($self->{terms}, $e->{terms});
# compute union
my %union = %{$self->{terms}};
my #keyse = keys %{$e->{terms}};
#union{#keyse} = #keyse;
my #allkeys = keys %union;
# weighted D
my #Dw = map(( defined( $Dv->{$_} )?
&{$weightFunction}( $rock, $_ )*$Dv->{$_} : 0.0 ),
#allkeys
);
# weighted E
my #Ew = map(( defined( $Ev->{$_} )?
&{$weightFunction}( $rock, $_ )*$Ev->{$_} : 0.0 ),
#allkeys
);
# dot product of D and E
my $dotProduct = 0.0;
map( $dotProduct += $Dw[$_] * $Ew[$_] , 0..$#Dw );
# norm of D
my $nD = 0.0;
map( $nD += $Dw[$_] * $Dw[$_] , 0..$#Dw );
$nD = sqrt( $nD );
# norm of E
my $nE = 0.0;
map( $nE += $Ew[$_] * $Ew[$_] , 0..$#Ew );
$nE = sqrt( $nE );
# dot product scaled by norm
if( ($nD==0) || ($nE==0) ){
return undef;
} else {
return $dotProduct / $nD / $nE;
}
}
I'm afraid I don't understand the theory behind what it is doing, but it looks like your problem is that either $nD ("norm of D") or $nE ("norm of D") is zero
All I can suggest is that your two text samples may be too similar/different, or perhaps they are too long/short?
Either way, your code should look like this so as to catch an invalid return value from the cosine function:
my $wSim = $t1->WeightedCosineSimilarity( $t2,
\&Text::DocumentCollection::IDF,
$collection
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wSim;
print "\nWeighted Cosine Sim is: $wSim\n";

what does "$K2ko{$D}{$C} = 1" do in perl?

cat inputfile
A<b>Metabolism</b>
B
B <b>Overview</b>
C 01200 Carbon metabolism [PATH:ko01200]
D K00844 HK; hexokinase [EC:2.7.1.1]
D K12407 GCK; glucokinase [EC:2.7.1.2]
...
#
open KO,'<',"inputfile" or die $!;
my ($A,$B,$C,$D,$path_DESC,$KO_DESC);
my %K2ko; my %K2DESC; my %ko2desc;
while (<KO>) {
if (/^A<b>(.+)<\/b>/) {$A=$1;}
elsif (/^B\s+<b>(.+)<\/b>/) {$B=$1;}
elsif (/^C\s+\d+\s+(.+)\s+\[PATH:(ko\d+)\]/) {
$path_DESC=$1;
$C=$2;
$ko2desc{$C} = "$A\t$B\t$path_DESC";
}
elsif (/^D\s+(K\d+)\s+(.*)/) {
$D=$1;
$KO_DESC=$2;
$K2ko{$D}{$C} = 1;
$K2DESC{$D} = $KO_DESC;
}
}
close KO;
#
Could anyone would like to tell me what does "$K2ko{$D}{$C} = 1" do in the perl script?
Thank you for any advice.
This is called a hash of hashes, which gives you a multidimensional hash. Here, "1" is the value for the above mentioned hash key.
Try to use Data::Dumper for know the structure of your data.
use Data::Dumper;
my %K2ko;
my $D = "val1";
my $C = "val2";
$K2ko{$D}{$C} = 1;
print Dumper \%K2ko;
Output
$VAR1 = {
'val1' => {
'val2' => 1
}
};
Using your sample data:
if (/^A<b>(.+)<\/b>/) {$A=$1;}
sets $A to 'Metabolism'
elsif (/^B\s+<b>(.+)<\/b>/) {$B=$1;}
sets $B to 'Overview'
elsif (/^C\s+\d+\s+(.+)\s+\[PATH:(ko\d+)\]/) {...}
sets $path_DESC to 'Carbon metabolism', $C to 'ko01200' and the hash
$ko2desc{'ko01200'} = "Metabolism\tOverview\tCarbon metabolism"
and finally
elsif (/^D\s+(K\d+)\s+(.*)/) {...}
will set
$D='K12407';
$KO_DESC='GCK; glucokinase [EC:2.7.1.2]';
$K2ko{'K12407'}{'ko01200'} = 1;
$K2DESC{'K12407'} = 'GCK; glucokinase [EC:2.7.1.2]';
$K2ko is an hash of hashes, setting it to 1 you can easily see where the component K12407 is used:
print join ',', keys %{$K2ko{'K12407'}};

perl array from csv file creating newline in unexpected place

Hi I have a few scripts that convert an xlsx file to a tab seperated file, which then remove any commas, duplicates and then splits it by commas. (i do this to make sure users have not put any commas in a colomn)
I then do some stuff. and then convert it back to an xlsx file. This has always worked fine. But instead of opening and closing files all the time i thought i would push the file to an array and then convert it to an xlsx at the end. Unfortunatly when i try and convert back to an xlsx file it is creating a newline in the space between the name. If i OUTPUT to a csv file then Open it and convert to an xlsx file it works fine.
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::BasicRead;
use Excel::Writer::XLSX;
local $" = "'\n'";
open( STDERR, ">&STDOUT" );
#covert to csv
my $xlsx_WSD = ( "C:\\Temp\\testing_file.xlsx"),, 1;
my #csvtemp;
if ( -e $xlsx_WSD ) {
my $ss = new Spreadsheet::BasicRead($xlsx_WSD) or die;
my $col = '';
my $row = 0;
while ( my $data = $ss->getNextRow() ) {
$row++;
$col= join( "\t", #$data );
push #csvtemp, $col . "\n" if ( $col ne "" );
}
}
else {
print " C:\\Temp\\testing_file.xlsx file EXISTS ...!!\n";
print " please investigate and use the restore option if required !..\n";
exit;
}
;
my #arraynew;
my %seen;
our $Header_row = shift (#csvtemp);
foreach (#csvtemp){
chomp;
$_ =~ s/,//g;
$_ =~ s/\t/,/g;
# print $_ . "\n" if !$seen{$_}++ ;
push #arraynew, $_ . "\n" if !$seen{$_}++ ; #remove any dupes
}
#covert back to xlsx
my $workbook = Excel::Writer::XLSX->new("C:\\Temp\\testing_filet.xlsx");
my $worksheet = $workbook->add_worksheet();
my ( $x, $y ) = ( 0, 0 );
while (<#arraynew>) {
my #list = split /,/;
foreach my $c (#list) {
$worksheet->write( $x, $y++, $c );
}
$x++;
$y = 0;
}
__DATA__
Animal keeper M/F Years START DATE FRH FSM
GIRAFFE JAMES LE M 5 10/12/2007 Y
HIPPO JACKIE LEAN F 6 11/12/2007 Y
ZEBRA JAMES LEHERN M 7 12/12/2007 Y
GIRAFFE AMIE CAHORT M 5 13/12/2012 Y
GIRAFFE MICKY JAMES M 5 14/06/2007 Y
MEERKAT JOHN JONES M 9 15/12/2007 v v
LEOPPARD JIM LEE M 8 16/12/2002
unexpected result
GIRAFFE JAMES
LE M 5 10/12/2007 Y
"
HIPPO" JACKIE
LEAN F 6 11/12/2007 Y
"
ZEBRA" JAMES
LEHERN M 7 12/12/2007 Y
"
GIRAFFE" AMIE
CAHORT M 5 13/12/2012 Y
"
GIRAFFE" MICKY
JAMES M 5 14/06/2007 Y
"
MEERKAT" JOHN
JONES M 9 15/12/2007 v v
"
LEOPPARD" JIM
LEE M 8 16/12/2002
Since you are running this on Windows, have you considered using Win32::OLE instead?
use strict;
use Win32::OLE;
my $app = Win32::OLE->GetActiveObject('Excel.Application')
|| Win32::OLE->new('Excel.Application', 'Quit');
my $wb = $app->Workbooks->Open("C:/Temp/testing_file.xlsx");
my $ws = $wb->ActiveSheet;
my $max_row = $ws->UsedRange->Rows->Count;
my $max_col = $ws->UsedRange->Columns->Count;
my ($row, %already) = (1);
while ($row <= $max_row) {
my ($col, #output) = (1);
while ($col <= $max_col) {
my $val = $ws->Cells($row, $col)->{Text};
if ($val =~ /[,\t]/) {
$val =~ tr/,//d;
$val =~ tr/\t/,/;
$ws->Cells($row, $col)->{Value} = $val;
}
#output[$col - 1] = $val;
$col++;
}
if ($already{join "|", #output}++) {
$ws->Rows($row)->EntireRow->Delete;
$max_row--;
} else {
$row++;
}
}
$wb->SaveAs("C:\\temp\\testing_filet.xlsx");
This is an issue with end of line characters.
There are three conventions for marking the end of a line: with \n on Unix, \r\n on Windows and \r on Mac. It looks as though your script is assuming the Mac convention while input and output use the Windows convention.
So after reading the input, a leading \n appears on all lines except the first. As long as this is also the case with the output lines prior to composing them with \r, you end up with an output file with perfectly \r\n-delimited lines. Clearly it's better to make your script wary of what line ending convention the input is using and ensure it uses the same for splitting the lines and composing the output.

summarize a table based on certain features in one column

I tried to summarize the following table based a the same features in column 1:
infile:
A m
A m
A n
A n
A m
A c
A m
A i
A n
A n
B n
B n
B n
B n
B n
B n
C o
C i
C q
I wrote the following code, but I do not know why it does not report the last feature
perl code.pl 1 2 infile
use warnings;
use strict;
my $col_feature = $ARGV[0];
my $col_to_be_collapsed = $ARGV[1];
my $infile = $ARGV[2];
open( my $fh1, "<$infile" );
my $temp;
my $line_count = 0;
my %count = ();
my #array = ();
while ( my $line = <$fh1> ) {
chomp($line);
my #line = split( "\t| ", $line );
my $to_be_collapsed = $line[ $col_to_be_collapsed - 1 ];
my $feature = $line[ $col_feature - 1 ];
if ( $line_count >= 1 && $temp ne '' ) {
my #temp = split( "\t| ", $temp );
my $to_be_collapsed_temp = $temp[ $col_to_be_collapsed - 1 ];
my $feature_temp = $temp[ $col_feature - 1 ];
if ( $feature_temp eq $feature ) {
push( #array, $to_be_collapsed );
}
else {
map { $count{$_}++ } #array;
print "$feature_temp:\t";
print "$_:$count{$_}\t" foreach sort { $a cmp $b } keys %count;
%count = ();
#array = ();
$temp = $line;
push( #array, $to_be_collapsed );
print "\n";
}
}
else {
$temp = $line;
push( #array, $to_be_collapsed );
}
$line_count++;
}
#print $temp,"\n";
output:
A: c:1 i:1 m:4 n:4
B: n:6
But there is no any report for C in the first column!!
Thanks
It will be alot easier to use a hash in this particular case as you just need to keep a counter.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
#open my $fh, '<', 'infile'; # Uncomment for live file.
my $fh = \*DATA; # For testing only.
my %counter;
while (<$fh>) {
my ( $outerkey, $innerkey ) = split;
$counter{$outerkey}{$innerkey}++;
}
for my $outerkey ( sort keys %counter ) {
print "$outerkey:";
print "\t$_:$counter{$outerkey}{$_}" for sort keys %{ $counter{$outerkey} };
print "\n";
}
__DATA__
A m
A m
A n
A n
A m
A c
A m
A i
A n
A n
B n
B n
B n
B n
B n
B n
C o
C i
C q
Output:
A: c:1 i:1 m:4 n:4
B: n:6
C: i:1 o:1 q:1

Majority Voting in perl?

I have 5 files containing the same words. I want to read each word in all the files and decide the winning word by detecting the following characters in a word (*, #, $, &) separated by tabs. Then, I want to generate an output file. Ii can only have 2 winners. For example:
file1
we$
are*
...
file2
we$
are#
...
file3
we&
are*
...
file4
we$
are#
...
file5
we$
are&
...
output file:
we$
are*#
Here is how I started:
#!/usr/local/bin/perl -w
sub read_file_line {
my $fh = shift;
if ($fh and my $line = <$fh>) {
chomp($line);
return $line;
}
return;
}
open(my $f1, "words1.txt") or die "Can't";
open(my $f2, "words2.txt") or die "Can't";
open(my $f3, "words3.txt") or die "Can't";
open(my $f4, "words4.txt") or die "Can't";
open(my $f5, "words5.txt") or die "Can't";
my $r1 = read_file_line($f1);
my $r2 = read_file_line($f2);
my $r3 = read_file_line($f3);
my $r4 = read_file_line($f4);
my $r5 = read_file_line($f5);
while ($f5) {
#What can I do here to decide and write the winning word in the output file?
$r1 = read_file_line($f1);
$r2 = read_file_line($f2);
$r3 = read_file_line($f3);
$r4 = read_file_line($f4);
$r5 = read_file_line($f5);
}
Test Data Generator
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $i (1..5)
{
my $file = "words$i.txt";
open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
foreach my $w (qw (we are the people in charge and what we say goes))
{
my $suffix = substr('*#$&', rand(4), 1);
print $fh "$w$suffix\n";
}
}
Majority Voting Code
#!/usr/bin/env perl
use strict;
use warnings;
my #files = ( "words1.txt", "words2.txt", "words3.txt",
"words4.txt", "words5.txt"
);
my #fh;
{
my $n = 0;
foreach my $file (#files)
{
open my $f, '<', $file or die "Can't open $file for reading ($!)";
$fh[$n++] = $f;
}
}
while (my $r = process_line(#fh))
{
print "$r\n";
}
sub process_line
{
my(#fhlist) = #_;
my %words = ();
foreach my $fh (#fhlist)
{
my $line = <$fh>;
return unless defined $line;
chomp $line;
$words{$line}++;
}
my $combo = '';
foreach my $word (keys %words)
{
return $word if ($words{$word} > 2);
$combo .= $word if ($words{$word} == 2);
}
$combo =~ s/(\W)\w+(\W)/$1$2/;
return $combo;
}
Example Data and Results
$ perl datagenerator.pl
$ perl majorityvoter.pl > results.txt
$ paste words?.txt results.txt
we* we$ we& we# we# we#
are* are# are# are* are$ are*#
the* the& the# the# the& the&#
people& people& people$ people# people# people&#
in# in* in$ in* in* in*
charge* charge# charge& charge* charge# charge#*
and$ and* and$ and& and$ and$
what& what& what$ what& what# what&
we# we* we* we& we* we*
say$ say& say$ say$ say$ say$
goes$ goes& goes# goes# goes# goes#
$
This seems to be correct for the test data in the files generated.
Revised requirements - example output
The 'revised requirements' replaced the '*#$&' markers after the words with a tab and one of the letters 'ABCD'. After some swift negotiation, the question is restored to its original form. This output is from a suitably adapted version of the answer above - 3 code lines changed, 2 in the data generator, 1 in the majority voter. Those changes are not shown - they are trivial.
we C we D we C we C we D we C
are C are D are C are B are A are C
the B the D the A the A the D the A|D
people D people B people A people B people D people B|D
in D in B in C in B in D in D|B
charge C charge D charge D charge D charge A charge D
and A and B and C and C and B and B|C
what B what B what B what C what C what B
we D we B we D we B we A we B|D
say D say D say B say D say D say D
goes A goes C goes A goes C goes A goes A
Revised test generator - for configurable number of files
Now that the poster has worked out how to handle the revised scenario, this is the data generator code I used - with 5 tags (A-E). Clearly, it would not take a huge amount of work to configure the number of tags on the command line.
#!/usr/bin/env perl
use strict;
use warnings;
my $fmax = scalar(#ARGV) > 0 ? $ARGV[0] : 5;
my $tags = 'ABCDE';
my $ntags = length($tags);
my $fmt = sprintf "words$fmax-%%0%0dd.txt", length($fmax);
foreach my $fnum (1..$fmax)
{
my $file = sprintf $fmt, $fnum;
open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
foreach my $w (qw(We Are The People In Charge And What We Say Goes))
{
my $suffix = substr($tags, rand($ntags), 1);
print $fh "$w\t$suffix\n";
}
}
Revised Majority Voting Code - for arbitrary number of files
This code works with basically arbitrary numbers of files. As noted in one of the (many) comments, it does not check that the word is the same in each file as required by the question; you could get quirky results if the words are not the same.
#!/usr/bin/env perl
use strict;
use warnings;
my #files = scalar #ARGV > 0 ? #ARGV :
( "words1.txt", "words2.txt", "words3.txt",
"words4.txt", "words5.txt"
);
my $voters = scalar(#files);
my #fh;
{
my $n = 0;
foreach my $file (#files)
{
open my $f, '<', $file or die "Can't open $file for reading ($!)";
$fh[$n++] = $f;
}
}
while (my $r = process_line(#fh))
{
print "$r\n";
}
sub process_line
{
my(#fhlist) = #_;
my %words = ();
foreach my $fh (#fhlist)
{
my $line = <$fh>;
return unless defined $line;
chomp $line;
$words{$line}++;
}
return winner(%words);
}
# Get tag X from entry "word\tX".
sub get_tag_from_word
{
my($word) = #_;
return (split /\s/, $word)[1];
}
sub winner
{
my(%words) = #_;
my $maxscore = 0;
my $winscore = ($voters / 2) + 1;
my $winner = '';
my $taglist = '';
foreach my $word (sort keys %words)
{
return "$word\t$words{$word}" if ($words{$word} >= $winscore);
if ($words{$word} > $maxscore)
{
$winner = $word;
$winner =~ s/\t.//;
$taglist = get_tag_from_word($word);
$maxscore = $words{$word};
}
elsif ($words{$word} == $maxscore)
{
my $newtag = get_tag_from_word($word);
$taglist .= "|$newtag";
}
}
return "$winner\t$taglist\t$maxscore";
}
One Example Run
After considerable experimentation on the data presentation, one particular set of data I generated gave the result:
We A|B|C|D|E 2 B C C E D A D A E B
Are D 4 C D B A D B D D B E
The A 5 D A B B A A B E A A
People D 4 E D C D B E D D B C
In D 3 E C D D D B C A A B
Charge A|E 3 E E D A D A B A E B
And E 3 C E D D C A B E B E
What A 5 B C C A A A B A D A
We A 4 C A A E A E C D A E
Say A|D 4 A C A A D E D A D D
Goes A 3 D B A C C A A E E B
The first column is the word; the second is the winning tag or tags; the third (numeric) column is the maximum score; the remaining 10 columns are the tags from the 10 data files. As you can see, there two each of 'We A', 'We B', ... 'We E' in the first row. I've also generated (but not preserved) one result set where the maximum score was 7. Given enough repetition, these sorts of variations are findable.
Sounds like the job for a hash of hashes. Untested code:
use strict;
use warnings;
use 5.010;
use autodie;
use List::Util qw( sum reduce );
my %totals;
my #files = map "words$_.txt", 1..5;
for my $file (#files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($word, $sign) = /(\w+)(\W)/;
$totals{$word}{$sign}++;
}
}
open my $totals_fh, '>', 'outfile.txt';
my #sorted_words = sort { sum values %{$totals{$a}} <=> sum values %{$totals{$b}} } keys %totals; #Probably something fancier here.
for my $word (#sorted_words[0, 1]) {
#say {$totals_fh} $word, join('', keys %{$totals{$word}} ), "\t- ", function_to_decide_text($totals{$word});
say {$totals_fh} $word, reduce {
$totals{$word}{ substr $a, 0, 1 } == $totals{$word}{$b} ? $a . $b
: $totals{$word}{ substr $a, 0, 1 } > $totals{$word}{$b} ? $a
: $b;
} keys %{ $totals{$word} };
}
EDIT: Forgot about the only two winners part. Fixed, somewhat.
EDIT2: Fixed as per comments.
#!/usr/bin/perl
use strict;
use warnings;
my #files = qw(file1 file2 file3 file4 file5);
my $symbols = '*#$&'; # no need to escape them as they'll be in a character class
my %words;
foreach my $file (#files) {
open(my $fh, '<', $file) or die "Cannot open $file: $!";
while (<$fh>) {
if (/^(\w+[$symbols])$/) {
$words{$1} ++; # count the occurrences of each word
}
}
close $fh;
}
my $counter = 0;
my $previous = -1;
foreach my $word (sort {$words{$b} <=> $words{$a}} keys %words) {
# make sure you don't exit if two words at the top of the list
# have the same number of occurrences
if ($previous != $words{$word}) {
last if $counter > 1;
}
$counter ++; # count the output
$previous = $words{$word};
print "$word occurred $words{$word} times.\n";
}
Worked when I tried it out...