Find mismatch on 2nd column between 2 text files

Find mismatch on 2nd column between 2 text files - perl

I have these 2 text files and I would like to find any mismatch on 2nd column between files. The mismatch to be identified is based on type of F ,P and N regardless which lines they occur. I have 1F, 3P in first file while 2P,1N and 1F in second file. When do comparison, both files should have equal occurrence of type 1F, 3P and 1N.
Text1:
f0x11 F
f0x34 P
drx99
dex67 P
edx43 P
sdx33
Text2:
1 P
2 N
4
5 F
6
7 P
Expected Output:
Text 1 has missing type of N
Text 2 has missing type of P
What I have tried so far does not produce desired output.
code:
use strict;
my %ref_data;
my %ref_data2;
open my $fh, '<', 'Text1' or die "Could not open file to read:$!";
while (<$fh>) {
chomp;
my ($res, $type) = split;
if (defined $type){
$ref_data{$type} = "$type";
}
}
our ($data,$data2);
open $fh, '<', 'Text2' or die "Could not open file to read:$!";
while (<$fh>) {
chomp;
my ($res, $type) = split;
if (defined $type){
$ref_data2{$type}= "$type";
$data2= $ref_data2{$type};
$data = $ref_data{$type};
print "File 2 has missing type of $type\n" unless $data;
}
}
foreach ($data){
print "File 1 has missing type of $_\n" if $data ne $data2;
}

You appear to want to keep track of how many times the values in Column 2 occur within each file -- for example, in a comment you write, "I have 1F, 3P in first file while 2P,1N and 1Fin second file". If that's the case, you need a better data structure.
Specifically, one that counts occurrences of the values in Column 2, and you need those counts to be tracked separately for each file. That suggests a hash-of-hashes.
use strict;
use warnings;
# Example usage:
# perl YOUR_SCRIPT.pl a.txt b.txt
my #files = #ARGV;
# Count the values in Column 2, organizing the tallies like this:
# $tallies{COL_2}{FILE_NAME} = N
my %tallies;
while (<>) {
my #cols = split;
$tallies{$cols[1]}{$ARGV} ++ if #cols > 1;
}
# Print discrepancies.
for my $c2 (keys %tallies) {
my #t = map { $tallies{$c2}{$_} || 0 } #files;
next if $t[0] == $t[1];
print "$c2: $files[0] has $t[0]; $files[1] has $t[1]\n";
}
Example output:
N: a.txt has 0; b.txt has 1
P: a.txt has 3; b.txt has 2
Also worth noting: this code does not have to open any files explicitly, and file names are not hard-coded in the program. Instead, we pass input file names as command-line arguments, get those arguments via #ARGV, process lines in those files via <>, and know which file we're currently processing via $ARGV.

I've refactored your code where you seem to be duplicating the same behavior.
The output isn't to spec, but should be clear enough for you to understand and finish up yourself.
I added a close $fh; and use warnings; as well
#!/usr/bin/perl
use strict;
use warnings;
#run
my %max; # hash of combined data
my $file_data_1 = parse_file_into_hash("text1", \%max);
my $file_data_2 = parse_file_into_hash("text2", \%max);
diff_hashes(\%max, $file_data_1, $file_data_2);
# diff_hashes($max, $h1, $h2)
#
# diffs 2 hash refs against a combined $max hash and prints results
sub diff_hashes {
my ($max, $h1, $h2) = #_;
# TODO - do all the comparisios and some error checking (if keys exist etc...) here
for my $key (keys %$max) {
print "max/combined: $key = $max->{$key}\n";
my $h1_print = exists $h1->{$key} ? $h1->{$key} : "0";
my $h2_print = exists $h2->{$key} ? $h2->{$key} : "0";
print "h1: $key = $h1_print\n";
print "h2: $key = $h2_print\n";
}
}
# parse_file_into_hash($file, $max)
#
# $max is a hash reference (passed by reference) so you can count occurences over
# multiple files...
# returns reference of hash ( $line_number => $data_value )
sub parse_file_into_hash {
my ($file, $max) = #_;
my %ref_data;
open my $fh, '<', $file or die "Could not open file to read:$!";
while (my $line = <$fh>) {
chomp $line;
my ($res, $type) = split /\s+/, $line;
if ($type) {
$ref_data{$type}++;
if (!exists $max->{$type} || $ref_data{$type} > $max->{$type}) {
$max->{$type} = $ref_data{$type};
}
}
}
close $fh;
return \%ref_data;
}
Output ran against your example files:
$ ./example.pl
max/combined: F = 1
h1: F = 1
h2: F = 1
max/combined: N = 1
h1: N = 0
h2: N = 1
max/combined: P = 3
h1: P = 3
h2: P = 2

Related

compare two txt files

I'm new to Perl. I have two text files and I need to check matching string on both lists.
For example matching strings are:
line - file 1: fe/bla/blablabla/abcdefg
line - file 2: blablabla/abcdefg
There is a match!
In addition, the location (line number) of the matching strings is not the same on both files.
I tried put the lists in arrays and compare the arrays with nested loop, but the running time of the program is huge (the lists contain thousand of lines) and I believe there is another way, less naïve and more productive.
This is the way I put the data in the array:
my $list1 = /path/to/the/file;
open (FILE , '<' , $list1) or die ("Could not open the file");
while ( my $line = <FILE> ) {
chomp ($line);
$list_1[$i] = $line;
$i = $i+1;
}
close FILE;
I did it to the other list as well.
And this is my nested loop.
for ( $k = 0 ; $k < #list_1 ; $k = $k+1 ) {
for ($i = 0 ; $i < #list_2 ; $i = $i+1 ) {
if (index($list_1[$k] , $list_2[$i]) != -1) {
splice (#list_2 , $i , 1);
last;
}
}
}

As long as file2 isn't enormous, the simplest way is to build a regular expression pattern from its contents and check each line in file1 against the pattern.
You don't say what output you want, so I have printed all lines in file1 that have a match in file2.
use strict;
use warnings;
use 5.010;
use autodie;
my ($list1, $list2) = qw( /path/to/list1 /path/to/list2 );
open my $fh, '<', $list2;
my $re = join '|', map { chomp; quotemeta; } <$fh>;
$re = qr/$re/;
open $fh, '<', $list2;
while ( <$fh> ) {
print if /$re/;
}

Perl Text-Parsing; Which algorithm is correct?

I am writing a Perl script that takes two files as input: one input is a tab-separated table with an identifier of interested in the second column, the second input is a list of identifiers that match the second column of the first file.
THE GOAL is print only those lines of the table which contain an identifier in the second column and to print each line only once. I have written three versions of this program and have been finding different numbers of lines printed in each.
Version 1:
# TAB-SEPARTED TABLE FILE
open (FILE, $file);
while (<FILE>) {
my $line = $_;
chomp $line;
# ARRAY CONTAINING EACH IDENTIFIER AS A SEPARATE ELEMENT
foreach(#refs) {
my $ref = $_;
chomp $ref;
if ( $line =~ $ref) { print "$line\n"; next; }
}
}
Version 2:
# ARRAY CONTAINING EVERY LINE OF THE TAB-SEPARATED TABLE AS A SEPARATE LINE
foreach(#doc) {
my $full = $_;
# IF LOOP FOR PRINTING THE HEADER BUT NOT COMPARING IT TO ARRAY BELOW
if ( $counter == 0 ) {
print "$full\n";
$counter++;
next; }
# EXTRACT IDENTIFIER FROM LINE
my #cells = split('\t', $full);
my $gene = $cells[1];
foreach(#refs) {
my $text = $_;
if ( $gene =~ $text && $counter == 1 ) { # COMPARE IDENTIFIER
print "$full\n";
next;
}
}
$counter--;
}
Version 3:
# LIST OF IDENTIFIERS
foreach(#refs) {
my $ref = $_;
# LIST OF EACH ROW OF THE TABLE
foreach(#doc) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
if ( $gene =~ $ref ) { print "$line\n"; next; }
}
}
Each of these approaches gives me different output and I do not understand why. I also do not understand if I can trust any of them to give me the right output. The right output should not contain any duplicate lines but more than one row might match any identifier from the list.
Sample Input File:
Position Symbol Name REF ALT
chr1:887801 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
chr1:888639 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:888659 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:897325 KLHL17 kelch-like 17 (Drosophila) G C
chr1:909238 PLEKHN1 pleckstrin homology domain containing, family N member 1 G C
chr1:982994 AGRN agrin T C
chr1:1254841 CPSF3L cleavage and polyadenylation specific factor 3-like C G
chr1:3301721 PRDM16 PR domain containing 16 C T
chr1:3328358 PRDM16 PR domain containing 16 T C
List is pulled from a file that looks like this:
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
Its put into an array using this code:
open (REF, $ref_file);
while (<REF>) {
my $line = $_;
chomp $line;
push(#refs, $line);
}
close REF;

Whenever you hear "I need to look up something", think hashes.
What you can do is create a hash that contains the elements you want to pull out of file #1. Then, use a second hash to track whether or not you printed it before:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use autodie; # This way, I don't have to check my open for failures
use constant {
TABLE_FILE => "file1.txt",
LOOKUP_FILE => "file2.txt",
};
open my $lookup_fh, "<", LOOKUP_FILE;
my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
chomp $symbol,
$lookup_table{$symbol} = 1;
}
close $lookup_fh;
open my $table_file, "<", TABLE_FILE;
my %is_printed;
while ( my $line = <$table_file> ) {
chomp $line;
my #line_array = split /\s+/, $line;
my $symbol = $line_array[1];
if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
say $line;
$is_printed{$symbol} = 1;
}
}
Two loops, but much more efficient. In yours, if you had 100 items in the first file, and 1000 items in the second file, you would have to loop 100 * 1000 times or 1,000,000. In this, you only loop the total number of lines in both files.
I use the three-parameter method of the open command which allows you to handle files with names that start with | or <, etc. Also, I use variables for my file handles which make it easier to pass the file handle to a subroutine if so desired.
I use use autodie; which handles issues such as what if my file doesn't open. In your program, the program would continue on its merry way. If you don't want to use autodie, you need to do this:
open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);
I use two hashes. The first is %lookup_table which stores the Symbols you want to print. When I go through the first file, I can simply check if `$lookup_table{$symbol} exists. If it doesn't, I don't print it, if it does, I print it.
The second hash %is_printed keeps track of Symbols I've already printed. If $is_printed{$symbol} exists, I know I've already printed that line.
Even though you said the second table is tab separated, I use /\s+/ as the split regular expression. This will catch a tab, but it will also catch if someone used two tabs (to keep things looking nice) or accidentally typed a space before that tab.

I'm pretty sure this should work:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' identifiers_file.txt data_file.txt
Given identifiers_file.txt such as this (to which I added NOC2L since there were no matching identifiers in your sample):
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L
then your output will be:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' idents.txt data.txt
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
If that's correct and you want a Perl version, you can just:
$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
| a2p

I suggest you to mix first version and second and add hashes to them.
First version because it's good(clear way) parse your data file line by line.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
my $line = $_;
chomp $line;
$refs{$line} = 0;
}
close REF;
#for head printing
$refs{'Symbol'} = 0;
open (FILE, $ARGV[1]);
while (<FILE>) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
#print $line, "\n" if exists $refs{$gene};
if(exists $refs{$gene} and $refs{$gene} == 0)
{
$refs{$gene}++;
print $line;
}
}
close FILE;

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}

my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.

I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.

If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

Iterate a simple program

I have the following file:
a b 5
c d 6
e f 7
g h 4
i j 3
k l 10
and I want to find which line present the minimum value in the third column and erase it from the initial file. After this, I want to iterate again the program and find again which line present the minimum and make the same thing for 2 more times.
The output file should be
c d 6
e f 7
k l 10
I tried to write the following code:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file1 = "try.dat";
open my $fg, "<", $file1 or die "Error during $file1 opening: $!";
my #vector;
while (<$fg>) {
push #vector, [ split ];
}
my $minimum = $vector[0][2];
my #blacklist;
for my $i (0 .. $#vector) {
if ($minimum > $vector[$i][2]){
$minimum = $vector[$i][2];
push #blacklist, #vector[$i+1];
}
}
#say "my minimum is $minimum";
#say "the blacklist is composed by #blacklist";
I don't know how to erase the elements contained by the #blacklist (that in the first case should be i j 3) and how to iterate everything.
Any help for the iteration?

This sort of thing is what Tie::File was made for. It allows you to modify the file in-place by modfying a tied array.
This program does what you want. The helper function minidx returns the first index of the element of the passed array that holds the smallest value.
The program works by copying the third field of the file records into array #field3, and finding the index of the smallest value in there. The element at that index is then deleted from both the file and #field3 using splice.
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', 'file.txt' or die $!;
my #field3 = map { (split)[2] } #file;
for (1 .. 3) {
my $i = minidx(\#field3);
splice #file, $i, 1;
splice #field3, $i, 1;
}
sub minidx {
my ($arr) = #_;
my ($i, $v);
for (0 .. $#$arr) {
($i, $v) = ($_, $arr->[$_]) unless defined $v and $arr->[$_] >= $v;
}
return $i;
}
output
c d 6
e f 7
k l 10

When you say #blacklist = #vector you are adding the entire #vector array to the black list. You probably want to do a push #blacklist, $vector[$i]. That will push the array reference into blacklist.
Now, blacklist has an array ref in it, so you have to deference it to print it.
say "the blacklist is composed by #{$blacklist[0]}";
Edit: For iterating and writing:
I would skip the #blacklist array (unless you need it for something else) and remove the min values from #vector. Then you can write #vector to some file.
my $num_elts_to_remove = 3;
for (my $j = 0; $j < $num_elts_to_remove; $j++) {
my $minimum = $vector[0][2];
my $min_idx = 0;
for my $i (0 .. $#vector) {
if ($minimum > $vector[$i][2]){
$minimum = $vector[$i][2];
$min_idx = $i;
}
}
push #blacklist, $vector[$min_index];
splice #vector, $min_idx, 1; #remove array with smallest value
}
Now write the array to a file
open OUT, ">", $outfile or die "Error: $!";
foreach(#vector) {
print OUT join " ", #$_;
print OUT "\n";
}
close(OUT);
Prints:
c d 6
e f 7
k l 10

Taking Borodin's Tie::File suggestion even further. I have written a cute module called Tie::Array::CSV which allow you to treat a delimited file as an array (and because it uses Tie::File underneath, it is both read and write). Because of this I can use Perlish operations like map and sort (and Schwartzian transform!) to perform this task:
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::Array::CSV;
tie my #data, 'Tie::Array::CSV', 'data', sep_char => ' ';
# get a list of row ids sorted by last value (inc)
my $i = 0;
my #sorted =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$i++, $_->[-1]] }
#data;
#splice the rows by index of the lowest three (from bottom to top)
splice #data, $_, 1 for reverse sort #sorted[0..2];
Note that in the end you want to remove rows from the bottom so that you don't have to reindex every time.

Majority Voting in perl?

I have 5 files containing the same words. I want to read each word in all the files and decide the winning word by detecting the following characters in a word (*, #, $, &) separated by tabs. Then, I want to generate an output file. Ii can only have 2 winners. For example:
file1
we$
are*
...
file2
we$
are#
...
file3
we&
are*
...
file4
we$
are#
...
file5
we$
are&
...
output file:
we$
are*#
Here is how I started:
#!/usr/local/bin/perl -w
sub read_file_line {
my $fh = shift;
if ($fh and my $line = <$fh>) {
chomp($line);
return $line;
}
return;
}
open(my $f1, "words1.txt") or die "Can't";
open(my $f2, "words2.txt") or die "Can't";
open(my $f3, "words3.txt") or die "Can't";
open(my $f4, "words4.txt") or die "Can't";
open(my $f5, "words5.txt") or die "Can't";
my $r1 = read_file_line($f1);
my $r2 = read_file_line($f2);
my $r3 = read_file_line($f3);
my $r4 = read_file_line($f4);
my $r5 = read_file_line($f5);
while ($f5) {
#What can I do here to decide and write the winning word in the output file?
$r1 = read_file_line($f1);
$r2 = read_file_line($f2);
$r3 = read_file_line($f3);
$r4 = read_file_line($f4);
$r5 = read_file_line($f5);
}

Test Data Generator
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $i (1..5)
{
my $file = "words$i.txt";
open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
foreach my $w (qw (we are the people in charge and what we say goes))
{
my $suffix = substr('*#$&', rand(4), 1);
print $fh "$w$suffix\n";
}
}
Majority Voting Code
#!/usr/bin/env perl
use strict;
use warnings;
my #files = ( "words1.txt", "words2.txt", "words3.txt",
"words4.txt", "words5.txt"
);
my #fh;
{
my $n = 0;
foreach my $file (#files)
{
open my $f, '<', $file or die "Can't open $file for reading ($!)";
$fh[$n++] = $f;
}
}
while (my $r = process_line(#fh))
{
print "$r\n";
}
sub process_line
{
my(#fhlist) = #_;
my %words = ();
foreach my $fh (#fhlist)
{
my $line = <$fh>;
return unless defined $line;
chomp $line;
$words{$line}++;
}
my $combo = '';
foreach my $word (keys %words)
{
return $word if ($words{$word} > 2);
$combo .= $word if ($words{$word} == 2);
}
$combo =~ s/(\W)\w+(\W)/$1$2/;
return $combo;
}
Example Data and Results
$ perl datagenerator.pl
$ perl majorityvoter.pl > results.txt
$ paste words?.txt results.txt
we* we$ we& we# we# we#
are* are# are# are* are$ are*#
the* the& the# the# the& the&#
people& people& people$ people# people# people&#
in# in* in$ in* in* in*
charge* charge# charge& charge* charge# charge#*
and$ and* and$ and& and$ and$
what& what& what$ what& what# what&
we# we* we* we& we* we*
say$ say& say$ say$ say$ say$
goes$ goes& goes# goes# goes# goes#
$
This seems to be correct for the test data in the files generated.
Revised requirements - example output
The 'revised requirements' replaced the '*#$&' markers after the words with a tab and one of the letters 'ABCD'. After some swift negotiation, the question is restored to its original form. This output is from a suitably adapted version of the answer above - 3 code lines changed, 2 in the data generator, 1 in the majority voter. Those changes are not shown - they are trivial.
we C we D we C we C we D we C
are C are D are C are B are A are C
the B the D the A the A the D the A|D
people D people B people A people B people D people B|D
in D in B in C in B in D in D|B
charge C charge D charge D charge D charge A charge D
and A and B and C and C and B and B|C
what B what B what B what C what C what B
we D we B we D we B we A we B|D
say D say D say B say D say D say D
goes A goes C goes A goes C goes A goes A
Revised test generator - for configurable number of files
Now that the poster has worked out how to handle the revised scenario, this is the data generator code I used - with 5 tags (A-E). Clearly, it would not take a huge amount of work to configure the number of tags on the command line.
#!/usr/bin/env perl
use strict;
use warnings;
my $fmax = scalar(#ARGV) > 0 ? $ARGV[0] : 5;
my $tags = 'ABCDE';
my $ntags = length($tags);
my $fmt = sprintf "words$fmax-%%0%0dd.txt", length($fmax);
foreach my $fnum (1..$fmax)
{
my $file = sprintf $fmt, $fnum;
open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
foreach my $w (qw(We Are The People In Charge And What We Say Goes))
{
my $suffix = substr($tags, rand($ntags), 1);
print $fh "$w\t$suffix\n";
}
}
Revised Majority Voting Code - for arbitrary number of files
This code works with basically arbitrary numbers of files. As noted in one of the (many) comments, it does not check that the word is the same in each file as required by the question; you could get quirky results if the words are not the same.
#!/usr/bin/env perl
use strict;
use warnings;
my #files = scalar #ARGV > 0 ? #ARGV :
( "words1.txt", "words2.txt", "words3.txt",
"words4.txt", "words5.txt"
);
my $voters = scalar(#files);
my #fh;
{
my $n = 0;
foreach my $file (#files)
{
open my $f, '<', $file or die "Can't open $file for reading ($!)";
$fh[$n++] = $f;
}
}
while (my $r = process_line(#fh))
{
print "$r\n";
}
sub process_line
{
my(#fhlist) = #_;
my %words = ();
foreach my $fh (#fhlist)
{
my $line = <$fh>;
return unless defined $line;
chomp $line;
$words{$line}++;
}
return winner(%words);
}
# Get tag X from entry "word\tX".
sub get_tag_from_word
{
my($word) = #_;
return (split /\s/, $word)[1];
}
sub winner
{
my(%words) = #_;
my $maxscore = 0;
my $winscore = ($voters / 2) + 1;
my $winner = '';
my $taglist = '';
foreach my $word (sort keys %words)
{
return "$word\t$words{$word}" if ($words{$word} >= $winscore);
if ($words{$word} > $maxscore)
{
$winner = $word;
$winner =~ s/\t.//;
$taglist = get_tag_from_word($word);
$maxscore = $words{$word};
}
elsif ($words{$word} == $maxscore)
{
my $newtag = get_tag_from_word($word);
$taglist .= "|$newtag";
}
}
return "$winner\t$taglist\t$maxscore";
}
One Example Run
After considerable experimentation on the data presentation, one particular set of data I generated gave the result:
We A|B|C|D|E 2 B C C E D A D A E B
Are D 4 C D B A D B D D B E
The A 5 D A B B A A B E A A
People D 4 E D C D B E D D B C
In D 3 E C D D D B C A A B
Charge A|E 3 E E D A D A B A E B
And E 3 C E D D C A B E B E
What A 5 B C C A A A B A D A
We A 4 C A A E A E C D A E
Say A|D 4 A C A A D E D A D D
Goes A 3 D B A C C A A E E B
The first column is the word; the second is the winning tag or tags; the third (numeric) column is the maximum score; the remaining 10 columns are the tags from the 10 data files. As you can see, there two each of 'We A', 'We B', ... 'We E' in the first row. I've also generated (but not preserved) one result set where the maximum score was 7. Given enough repetition, these sorts of variations are findable.

Sounds like the job for a hash of hashes. Untested code:
use strict;
use warnings;
use 5.010;
use autodie;
use List::Util qw( sum reduce );
my %totals;
my #files = map "words$_.txt", 1..5;
for my $file (#files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($word, $sign) = /(\w+)(\W)/;
$totals{$word}{$sign}++;
}
}
open my $totals_fh, '>', 'outfile.txt';
my #sorted_words = sort { sum values %{$totals{$a}} <=> sum values %{$totals{$b}} } keys %totals; #Probably something fancier here.
for my $word (#sorted_words[0, 1]) {
#say {$totals_fh} $word, join('', keys %{$totals{$word}} ), "\t- ", function_to_decide_text($totals{$word});
say {$totals_fh} $word, reduce {
$totals{$word}{ substr $a, 0, 1 } == $totals{$word}{$b} ? $a . $b
: $totals{$word}{ substr $a, 0, 1 } > $totals{$word}{$b} ? $a
: $b;
} keys %{ $totals{$word} };
}
EDIT: Forgot about the only two winners part. Fixed, somewhat.
EDIT2: Fixed as per comments.

#!/usr/bin/perl
use strict;
use warnings;
my #files = qw(file1 file2 file3 file4 file5);
my $symbols = '*#$&'; # no need to escape them as they'll be in a character class
my %words;
foreach my $file (#files) {
open(my $fh, '<', $file) or die "Cannot open $file: $!";
while (<$fh>) {
if (/^(\w+[$symbols])$/) {
$words{$1} ++; # count the occurrences of each word
}
}
close $fh;
}
my $counter = 0;
my $previous = -1;
foreach my $word (sort {$words{$b} <=> $words{$a}} keys %words) {
# make sure you don't exit if two words at the top of the list
# have the same number of occurrences
if ($previous != $words{$word}) {
last if $counter > 1;
}
$counter ++; # count the output
$previous = $words{$word};
print "$word occurred $words{$word} times.\n";
}
Worked when I tried it out...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Find mismatch on 2nd column between 2 text files - perl

Related

compare two txt files

Perl Text-Parsing; Which algorithm is correct?

Perl compare individual elements of two arrays

Iterate a simple program

Majority Voting in perl?

Categories

Resources