How can I use map to clean up this Perl code? - perl

The code below does what I want it to. It prints the list and adds an asterisk at the end of lines that are not sequential, e.g. if you skip from 1 to 3 or 3 to 5.
use strict;
use warnings;
#note: thanks to all who helped with formatting issues.
#note: I recognize a hash would be a much better option for what I want to do.
my #printy = ("1 -> this",
"5 -> that",
"3 -> the other",
"6 -> thus and such");
#printy = sort {num($a) <=> num($b)} #printy;
my $thisID = 0;
my $lastID = 0;
#print out (line)* if initial number is >1 more than previous, or just (line) otherwise
for (#printy)
{
$thisID = $_; $thisID =~s/ .*//g;
if ($thisID - $lastID != 1) { $_ =~ s/$/ \*/; }
$lastID = $thisID;
}
print join("\n", #printy) . "\n";
sub num
{
my $x = $_[0];
$x =~ s/ .*//;
return $x;
}
But I think I can do better. It feels tangled, and my intuition tells me I'm missing something powerful that could do the job more easily, one that takes maybe two lines.
Now I've used the map() command before, but only to look at/modify an element, not how it compares to a previous element. Can anyone recommend a way to make this more succinct? Thanks!

Since Perl promotes TIMTOWTDI, map may seem like an attractive option at first. Let's see how it fares for this task:
Schwartzian thought process
Since access to neighboring elements is necessary, it's convenient to work with the indices. Since for n elements, there are n-1 pairs of neighbors, you don't have to loop n times. In this case, let's start with 1 instead of the usual 0:
1 .. $#printy
One can access neighboring elements by calling the relevant indices inside map.
map { my $prev = $printy[$_-1]; my $curr = $printy[$_] } 1 .. $#printy;
An array slice expresses this more succinctly:
map { my ( $prev, $curr ) = #printy[$_-1,$_]; } 1 .. $#printy;
Let's introduce the real logic related to comparing numbers using the num subroutine:
map {
my ( $prev, $curr ) = #printy[$_-1,$_];
if ( num($curr) - num($prev) > 1 ) {
"$curr *";
}
else {
$curr;
}
} 1 .. $#printy;
Which is equivalent to:
map {
my ( $prev, $curr ) = #printy[$_-1,$_];
$curr .= " *" if num($curr) - num($prev) > 1;
$curr
} 1 .. $#printy;
Remember not to forget the first element:
#printy = ( $printy[0],
map {
my ( $prev, $curr ) = #printy[$_-1,$_];
$curr .= " *" if num($curr) - num($prev) > 1;
$curr
} 1 .. $#printy
);
Given the final result, I'm not so sure I'd use map for this:
It's hard to read
There's a lot going on
The next person working on your code will love you

No map needed, just add some spaces here and there, and remove stuff that's not needed ($_, join, etc.). Also, reuse num() inside the loop, no need to repeat the regex.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #printy = sort { num($a) <=> num($b) }
'1 -> this', '5 -> that', '3 -> the other', '6 -> thus and such';
my $thisID = my $lastID = 0;
for (#printy) {
$thisID = num($_);
$_ .= ' *' if $thisID - $lastID != 1;
$lastID = $thisID;
}
say for #printy;
sub num {
my ($x) = #_;
$x =~ s/ .*//;
return $x;
}
Also, reimplementing num using /(\d+)/ instead of substitution might tell its purpose more clearly.

I agree with choroba that there is no need for a map here. But I'd refactor a little bit anyway.
use strict;
use warnings;
use feature 'say';
my #printy = ( "1 -> this", "5 -> that", "3 -> the other", "6 -> thus and such" );
my $last_id = 0;
foreach my $line ( sort { num($a) <=> num($b) } #printy ) {
my $current_id = num($line);
$line .= ' *' unless $current_id - $last_id == 1;
$last_id = $current_id;
}
say for #printy;
# returns the number at the start of a string
sub num {
$_[0] =~ m/^(\d+)/;
return $1;
}
I moved the sort down into the foreach, because you shouldn't rely on the fact that your input is sorted.
I changed the variable names to go with the convention that there should be no capital letters in variable names, and I used say, which is like print with a system-specific newline at the end.
I also moved the $current_id into the loop. That doesn't need to be visible outside because it's lexical to that loop. Always declare variables in the smallest possible scope.
You already had that nice num function, but you're not using it inside of the loop to get the $current_id. Use it.
I think if the input gets very long, it might make sense to go with a map construct because sorting will be very expensive at some point. Look at the Schwartzian transform for caching the calculation before sorting. You could then do everything at once. But it won't be readable for a beginner any more.

Your data yells "Use a hash!" to me.
If we had a hash,
my %printy =
map { split / -> / }
"1 -> this", "5 -> that", "3 -> the other", "6 -> thus and such";
The solution would simply be:
my #order = sort { $a <=> $b } keys(%printy);
for my $i (#order[1..$#order]) {
$printy{$i} .= ' *'
if !exists($printy{$i-1});
}
print "$_ -> $printy{$_}\n"
for #order;
This can be golfed down, though I'm not sure it's worth it.
my $count;
print "$_ -> $printy{$_}".( !$count++ || exists($printy{$_-1}) ? "" : " *" )."\n"
for
sort { $a <=> $b }
keys(%printy);
That for can be converted into a map, but it just makes it less efficient.
my $count;
print
map { "$_ -> $printy{$_}".( !$count++ || exists($printy{$_-1}) ? "" : " *" )."\n" }
sort { $a <=> $b }
keys(%printy);

I'd also advise to clean up the code and keep the loop. However, here is a map based way.
The code uses your sorted #printy and the num sub.
my #nums = map { num($_) } #printy;
my #res = map {
$nums[$_] == $nums[$_-1] + 1 # invariably false for $_ == 0
? $printy[$_] : $printy[$_] .= ' *';
}
(0..$#printy);
say for #res;
This works for the first element since it does not come after the last, given that we're sorted. That may be a bit diabolical though and it needs a comment in code. So perhaps better spell it out
my #res = map {
($nums[$_] == $nums[$_-1] + 1) ? $printy[$_] : $printy[$_] .= ' *';
}
(1..$#printy);
unshift #res, $printy[0];
Not as clean but clear.
All this does extra work when compared to a straight loop, of course.

I'm sorry, but your code is a shambles, and you need to do much more than use map to clean up this code
You have no indentation and multiple statements on a single line, and you haven't thought through your logic. Your code is unmaintainable
Here's how I would write this. It builds a parallel array of IDs, and then sorts a list of indices so that both the IDs and the original data are in order
If it makes you happier, it does include map
use strict;
use warnings 'all';
my #data = ( '1 -> this', '5 -> that', '3 -> the other', '6 -> thus and such' );
my #ids = map { /(\d+)/ } #data;
my #indexes = sort { $ids[$a] <=> $ids[$b] } 0 .. $#ids;
my $last_id;
for my $i ( #indexes ) {
print $data[$i];
print ' *' if defined $last_id and $ids[$i] > $last_id + 1;
print "\n";
$last_id = $ids[$i];
}
output
1 -> this
3 -> the other *
5 -> that *
6 -> thus and such

Related

Passing strings as array to subroutine and return count of specific char

I was trying to think in the right way to tackle this:
-I would to pass say, n elements array as argument to a subroutine. And for each element match two char types S and T and print for each element, the count of these letters. So far I did this but I am locked and found some infinite loops in my code.
use strict;
use warnings;
sub main {
my #array = #_;
while (#array) {
my $s = ($_ = tr/S//);
my $t = ($_ = tr/T//);
print "ST are in total $s + $t\n";
}
}
my #bunchOfdata = ("QQQRRRRSCCTTTS", "ZZZSTTKQSST", "ZBQLDKSSSS");
main(#bunchOfdata);
I would like the output to be:
Element 1 Counts of ST = 5
Element 2 Counts of ST = 6
Element 3 Counts of ST = 4
Any clue how to solve this?
while (#array) will be an infinite loop since #array never gets smaller. You can't read into the default variable $_ this way. For this to work, use for (#array) which will read the array items into $_ one at a time until all have been read.
The tr transliteration operator is the right tool for your task.
The code needed to get your results could be:
#!/usr/bin/perl
use strict;
use warnings;
my #data = ("QQQRRRRSCCTTTS", "ZZZSTTKQSST", "ZBQLDKSSSS");
my $i = 1;
for (#data) {
my $count = tr/ST//;
print "Element $i Counts of ST = $count\n";
$i++;
}
Also, note that my $count = tr/ST//; doesn't require the binding of the transliteration operator with $_. Perl assumes this when $_ holds the value to be counted here. Your code tried my $s = ($_ = tr/S//); which will give the results but the shorter way I've shown is the preferred way.
(Just noticed you had = instead of =~ in your statement. That is an error. Has to be $s = ($_ =~ tr/S//);)
You can combine the 2 sought letters as in my code. Its not necessary to do them separately.
I got the output you want.
Element 1 Counts of ST = 5
Element 2 Counts of ST = 6
Element 3 Counts of ST = 4
Also, you can't perform math operations in a quoted string like you had.
print "ST are in total $s + $t\n";
Instead, you would need to do:
print "ST are in total ", $s + $t, "\n";
where the operation is performed outside of the string.
Don't use while to traverse an array - your array gets no smaller, so the condition is always true and you get an infinite loop. You should use for (or foreach) instead.
for (#array) {
my $s = tr/S//; # No need for =~ as tr/// works on $_ by default
my $t = tr/T//;
print "ST are in total $s + $t\n";
}
Why tr///??
sub main {
my #array = #_;
while (#array) {
my $s = split(/S/, $_, -1) - 1;
my $t = split(/T/, $_, -1) - 1;
print "ST are in total $s + $t\n";
}
}

Compare 2 sets of numbers contained in strings

I have 2 scalars as below:
$a = '100 105 010';
$b = '010 105 100';
How do I compare if both has same set of values? order doesn't matter.
one way is to arrange it in ascending order and compare, is there a better way?
You can split each string into an array and sort and compare arrays. By hand:
use warnings;
use strict;
my $x = '100 105 1 010';
my $y = '010 105 100 2';
my #xs = sort { $a <=> $b } split ' ', $x;
my #ys = sort { $a <=> $b } split ' ', $y;
if (#xs != #ys) {
print "Differ in number of elements.\n";
}
else {
for (0..$#xs) {
print "Differ: $xs[$_] vs $ys[$_]\n" if $xs[$_] != $ys[$_];
}
}
# For boolean comparison put it in a sub
print arr_num_eq(\#xs, \#ys), "\n";
sub arr_num_eq {
my ($ra, $rb) = #_;
return 0 if #$ra != #$rb;
$ra->[$_] != $rb->[$_] && return 0 for 0..$#$ra;
return 1;
}
The sorting can be moved to the sub as well, which would then take strings. The way it stands it can be used for comparison of existing arrays as well. Please add argument checking.
There is a number of modules that have this capability. The perm from Array::Compare hides the sorting above, but internally joins sorted arrays into strings thus duplicating the work here since we started with strings. The List::AllUtils certainly offers this as well with its long list of utilities.
See this post, for example, for a few methods (just not the smart match ~~), and for benchmarks if efficiency is a concern.
Using the mentioned implementation idea from Array::Compare, per comment by ysth
sub str_num_eq {
return join(' ', sort { $a <=> $b } split / /, $_[0])
eq join(' ', sort { $a <=> $b } split / /, $_[1])
}
What the most suitable method is depends on what this is for and how it is used. Is it only a boolean comparison, or will more be done if they are found to differ? How does it come about in your program flow? What are typical sizes of strings, how often is it run? Are the strings most often the same or different, do they typically differ a lot or a little? Etc.
Without modules, you can use hashes:
#!/usr/bin/perl
use warnings;
use strict;
my $x = '100 105 010 2';
my $y = '010 105 100 100 1';
my (%hx, %hy);
$hx{$_}++ for split ' ', $x;
$hy{$_}++ for split ' ', $y;
for my $k (keys %hx) {
if (! exists $hy{$k}) {
print "$k missing in y\n";
} elsif ($hy{$k} != $hx{$k}) {
print "$k has different number of occurences\n";
}
delete $hy{$k};
}
print "$_ missing in x\n" for keys %hy;
$a and $b are special variables used in sort, so I renamed them to $x and $y.
split turns the strings into lists. Each hash counts how many times a member occurs in the list.
See also Perl FAQ 4.
Something else try with pattern matching,
This is not a straight forward but it will work.
Construct the pattern by anyone of your scalar value. Then check the another string by the constructed pattern.
my $a = '100 100 105';
my $b = '100 105 100';
my #b_ary = split(" ",$b);
my $regex = join'\b|\b', #b_ary;
my $word_length = #b_ary * 2 - 1; #Count the number of words and space.
my $rgx = qr"^(?:\b$regex\b|\s){$word_length}$"; #`{n}` match word exact n times
if($a=~m/$rgx/)
{
print "same values\n";
}
else
{
print "Not a same values\n";
}
The answer is already posted above. This is just in case you want to remove the white spaces and compare each number.
$x = '100 105 010';
$y = '010 105 100';
join("",sort split "",join("",split " ",$x)) eq join("",sort split "",join("",split " ",$y));

Find the word with most letters in common with other words

I want Perl (5.8.8) to find out what word has the most letters in common with the other words in an array - but only letters that are in the same place. (And preferably without using libs.)
Take this list of words as an example:
BAKER
SALER
BALER
CARER
RUFFR
Her BALER is the word that has the most letters in common with the others. It matches BAxER in BAKER, xALER in SALER, xAxER in CARER, and xxxxR in RUFFR.
I want Perl to find this word for me in an arbitrary list of words with the same length and case. Seems I've hit the wall here, so help is much appreciated!
What I've tried until now
Don't really have much of a script at the moment:
use strict;
use warnings;
my #wordlist = qw(BAKER SALER MALER BARER RUFFR);
foreach my $word (#wordlist) {
my #letters = split(//, $word);
# now trip trough each iteration and work magic...
}
Where the comment is, I've tried several kinds of code, heavy with for-loops and ++ varables. Thus far, none of my attempts have done what I need it to do.
So, to better explain: What I need is to test word for word against the list, for each letterposition, to find the word that has the most letters in common with the others in the list, at that letter's position.
One possible way could be to first check which word(s) has the most in common at letter-position 0, then test letter-position 1, and so on, until you find the word that in sum has the most letters in common with the other words in the list. Then I'd like to print the list like a matrix with scores for each letterposition plus a total score for each word, not unlike what DavidO suggest.
What you'd in effect end up with is a matrix for each words, with the score for each letter position, and the sum total score fore each word in the matrix.
Purpose of the Program
Hehe, I might as well say it: The program is for hacking terminals in the game Fallout 3. :D My thinking is that it's a great way to learn Perl while also having fun gaming.
Here's one of the Fallout 3 terminal hacking tutorials I've used for research: FALLOUT 3: Hacking FAQ v1.2, and I've already made a program to shorten the list of words, like this:
#!/usr/bin/perl
# See if one word has equal letters as the other, and how many of them are equal
use strict;
use warnings;
my $checkword = "APPRECIATION"; # the word to be checked
my $match = 4; # equal to the match you got from testing your checkword
my #checkletters = split(//, $checkword); #/
my #wordlist = qw(
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
);
print "$checkword has $match letters in common with:\n";
foreach my $word (#wordlist) {
next if $word eq $checkword;
my #letters = split(//, $word);
my $length = #letters; # determine length of array (how many letters to check)
my $eq_letters = 0; # reset to 0 for every new word to be tested
for (my $i = 0; $i < $length; $i++) {
if ($letters[$i] eq $checkletters[$i]) {
$eq_letters++;
}
}
if ($eq_letters == $match) {
print "$word\n";
}
}
# Now to make a script on to find the best word to check in the first place...
This script will yield CONSTRUCTION and TRANSMISSION as its result, just as in the game FAQ. The trick to the original question, though (and the thing I didn't manage to find out on my own), is how to find the best word to try in the first place, i.e. APPRECIATION.
OK, I've now supplied my own solution based on your help, and consider this thread closed. Many, many thanks to all the contributers. You've helped tremendously, and on the way I've also learned a lot. :D
Here's one way. Having re-read your spec a couple of times I think it's what you're looking for.
It's worth mentioning that it's possible there will be more than one word with an equal top score. From your list there's only one winner, but it's possible that in longer lists, there will be several equally winning words. This solution deals with that. Also, as I understand it, you count letter matches only if they occur in the same column per word. If that's the case, here's a working solution:
use 5.012;
use strict;
use warnings;
use List::Util 'max';
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
my #scores;
foreach my $word ( #words ) {
my $score;
foreach my $comp_word ( #words ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
say "Words with most matches:";
say for #words[#max_ixs];
This solution counts how many times per letter column each word's letters match other words. So for example:
Words: Scores: Because:
ABC 1, 2, 1 = 4 A matched once, B matched twice, C matched once.
ABD 1, 2, 1 = 4 A matched once, B matched twice, D matched once.
CBD 0, 2, 1 = 3 C never matched, B matched twice, D matched once.
BAC 0, 0, 1 = 1 B never matched, A never matched, C matched once.
That gives you the winners of ABC and ABD, each with a score of four positional matches. Ie, the cumulative times that column one, row one matched column one row two, three, and four, and so on for the subsequent columns.
It may be able to be optimized further, and re-worded to be shorter, but I tried to keep the logic fairly easy to read. Enjoy!
UPDATE / EDIT
I thought about it and realized that though my existing method does exactly what your original question requested, it did it in O(n^2) time, which is comparatively slow. But if we use hash keys for each column's letters (one letter per key), and do a count of how many times each letter appears in the column (as the value of the hash element), we could do our summations in O(1) time, and our traversal of the list in O(n*c) time (where c is the number of columns, and n is the number of words). There's some setup time too (creation of the hash). But we still have a big improvement. Here is a new version of each technique, as well as a benchmark comparison of each.
use strict;
use warnings;
use List::Util qw/ max sum /;
use Benchmark qw/ cmpthese /;
my #words = qw/
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
/;
# Just a test run for each solution.
my( $top, $indexes_ref );
($top, $indexes_ref ) = find_top_matches_force( \#words );
print "Testing force method: $top matches.\n";
print "#words[#$indexes_ref]\n";
( $top, $indexes_ref ) = find_top_matches_hash( \#words );
print "Testing hash method: $top matches.\n";
print "#words[#$indexes_ref]\n";
my $count = 20000;
cmpthese( $count, {
'Hash' => sub{ find_top_matches_hash( \#words ); },
'Force' => sub{ find_top_matches_force( \#words ); },
} );
sub find_top_matches_hash {
my $words = shift;
my #scores;
my $columns;
my $max_col = max( map { length $_ } #{$words} ) - 1;
foreach my $col_idx ( 0 .. $max_col ) {
$columns->[$col_idx]{ substr $_, $col_idx, 1 }++
for #{$words};
}
foreach my $word ( #{$words} ) {
my $score = sum(
map{
$columns->[$_]{ substr $word, $_, 1 } - 1
} 0 .. $max_col
);
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
sub find_top_matches_force {
my $words = shift;
my #scores;
foreach my $word ( #{$words} ) {
my $score;
foreach my $comp_word ( #{$words} ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if
substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
The output is:
Testing force method: 39 matches.
APPRECIATION
Testing hash method: 39 matches.
APPRECIATION
Rate Force Hash
Force 2358/s -- -74%
Hash 9132/s 287% --
I realize your original spec changed after you saw some of the other options provided, and that's sort of the nature of innovation to a degree, but the puzzle was still alive in my mind. As you can see, my hash method is 287% faster than the original method. More fun in less time!
As a starting point, you can efficiently check how many letters they have in common with:
$count = ($word1 ^ $word2) =~ y/\0//;
But that's only useful if you loop through all possible pairs of words, something that isn't necessary in this case:
use strict;
use warnings;
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
# you want a hash to indicate which letters are present how many times in each position:
my %count;
for my $word (#words) {
my #letters = split //, $word;
$count{$_}{ $letters[$_] }++ for 0..$#letters;
}
# then for any given word, you get the count for each of its letters minus one (because the word itself is included in the count), and see if it is a maximum (so far) for any position or for the total:
my %max_common_letters_count;
my %max_common_letters_words;
for my $word (#words) {
my #letters = split //, $word;
my $total;
for my $position (0..$#letters, 'total') {
my $count;
if ( $position eq 'total' ) {
$count = $total;
}
else {
$count = $count{$position}{ $letters[$position] } - 1;
$total += $count;
}
if ( ! $max_common_letters_count{$position} || $count >= $max_common_letters_count{$position} ) {
if ( $max_common_letters_count{$position} && $count == $max_common_letters_count{$position} ) {
push #{ $max_common_letters_words{$position} }, $word;
}
else {
$max_common_letters_count{$position} = $count;
$max_common_letters_words{$position} = [ $word ];
}
}
}
}
# then show the maximum words for each position and in total:
for my $position ( sort { $a <=> $b } grep $_ ne 'total', keys %max_common_letters_count ) {
printf( "Position %s had a maximum of common letters of %s in words: %s\n",
$position,
$max_common_letters_count{$position},
join(', ', #{ $max_common_letters_words{$position} })
);
}
printf( "The maximum total common letters was %s in words(s): %s\n",
$max_common_letters_count{'total'},
join(', ', #{ $max_common_letters_words{'total'} })
);
Here's a complete script. It uses the same idea that ysth mentioned (although I had it independently). Use bitwise xor to combine the strings, and then count the number of NULs in the result. As long as your strings are ASCII, that will tell you how many matching letters there were. (That comparison is case sensitive, and I'm not sure what would happen if the strings were UTF-8. Probably nothing good.)
use strict;
use warnings;
use 5.010;
use List::Util qw(max);
sub findMatches
{
my ($words) = #_;
# Compare each word to every other word:
my #matches = (0) x #$words;
for my $i (0 .. $#$words-1) {
for my $j ($i+1 .. $#$words) {
my $m = ($words->[$i] ^ $words->[$j]) =~ tr/\0//;
$matches[$i] += $m;
$matches[$j] += $m;
}
}
# Find how many matches in the best word:
my $max = max(#matches);
# Find the words with that many matches:
my #wanted = grep { $matches[$_] == $max } 0 .. $#matches;
wantarray ? #$words[#wanted] : $words->[$wanted[0]];
} # end findMatches
my #words = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
say for findMatches(\#words);
Haven't touched perl in a while, so pseudo-code it is. This isn't the fastest algorithm, but it will work fine for a small amount of words.
totals = new map #e.g. an object to map :key => :value
for each word a
for each word b
next if a equals b
totals[a] = 0
for i from 1 to a.length
if a[i] == b[i]
totals[a] += 1
end
end
end
end
return totals.sort_by_key.last
Sorry about the lack of perl, but if you code this into perl, it should work like a charm.
A quick note on run-time: this will run in time number_of_words^2 * length_of_words, so on a list of 100 words, each of length 10 characters, this will run in 100,000 cycles, which is adequate for most applications.
Here's a version that relies on transposing the words in order to count the identical characters. I used the words from your original comparison, not the code.
This should work with any length words, and any length list. Output is:
Word score
---- -----
BALER 12
SALER 11
BAKER 11
CARER 10
RUFFR 4
The code:
use warnings;
use strict;
my #w = qw(BAKER SALER BALER CARER RUFFR);
my #tword = t_word(#w);
my #score;
push #score, str_count($_) for #tword;
#score = t_score(#score);
my %total;
for (0 .. $#w) {
$total{$w[$_]} = $score[$_];
}
print "Word\tscore\n";
print "----\t-----\n";
print "$_\t$total{$_}\n" for (sort { $total{$b} <=> $total{$a} } keys %total);
# transpose the words
sub t_word {
my #w = #_;
my #tword;
for my $word (#w) {
my $i = 0;
while ($word =~ s/(.)//) {
$tword[$i++] .= $1;
}
}
return #tword;
}
# turn each character into a count
sub str_count {
my $str = uc(shift);
while ( $str =~ /([A-Z])/ ) {
my $chr = $1;
my $num = () = $str =~ /$chr/g;
$num--;
$str =~ s/$chr/$num /g;
}
return $str;
}
# sum up the character counts
# while reversing the transpose
sub t_score {
my #count = #_;
my #score;
for my $num (#count) {
my $i = 0;
while( $num =~ s/(\d+) //) {
$score[$i++] += $1;
}
}
return #score;
}
Here is my attempt at an answer. This will also allow you to see each individual match if you need it. (ie. BALER matches 4 characters in BAKER). EDIT: It now catches all matches if there is a tie between words (I added "CAKER" to the list to test).
#! usr/bin/perl
use strict;
use warnings;
my #wordlist = qw( BAKER SALER BALER CARER RUFFR CAKER);
my %wordcomparison;
#foreach word, break it into letters, then compare it against all other words
#break all other words into letters and loop through the letters (both words have same amount), adding to the count of matched characters each time there's a match
foreach my $word (#wordlist) {
my #letters = split(//, $word);
foreach my $otherword (#wordlist) {
my $count;
next if $otherword eq $word;
my #otherwordletters = split (//, $otherword);
foreach my $i (0..$#letters) {
$count++ if ( $letters[$i] eq $otherwordletters[$i] );
}
$wordcomparison{"$word"}{"$otherword"} = $count;
}
}
# sort (unnecessary) and loop through the keys of the hash (words in your list)
# foreach key, loop through the other words it compares with
#Add a new key: total, and sum up all the matched characters.
foreach my $word (sort keys %wordcomparison) {
foreach ( sort keys %{ $wordcomparison{$word} }) {
$wordcomparison{$word}{total} += $wordcomparison{$word}{$_};
}
}
#Want $word with highest total
my #max_match = (sort { $wordcomparison{$b}{total} <=> $wordcomparison{$a}{total} } keys %wordcomparison );
#This is to get all if there is a tie:
my $maximum = $max_match[0];
foreach (#max_match) {
print "$_\n" if ($wordcomparison{$_}{total} >= $wordcomparison{$maximum}{total} )
}
The output is simply: CAKER BALER and BAKER.
The hash %wordcomparison looks like:
'SALER'
{
'RUFFR' => 1,
'BALER' => 4,
'BAKER' => 3,
'total' => 11,
'CARER' => 3
};
You can do this, using a dirty regex trick to execute code if a letter matches in its place, but not otherwise, thankfully it's quite easy to build the regexes as you go:
An example regular expression is:
(?:(C(?{ $c++ }))|.)(?:(A(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)(?:(E(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)
This may or may not be fast.
use 5.12.0;
use warnings;
use re 'eval';
my #words = qw(BAKER SALER BALER CARER RUFFR);
my ($best, $count) = ('', 0);
foreach my $word (#words) {
our $c = 0;
foreach my $candidate (#words) {
next if $word eq $candidate;
my $regex_str = join('', map {"(?:($_(?{ \$c++ }))|.)"} split '', $word);
my $regex = qr/^$regex_str$/;
$candidate =~ $regex or die "did not match!";
}
say "$word $c";
if ($c > $count) {
$best = $word;
$count = $c;
}
}
say "Matching: first best: $best";
Using xor trick will be fast but assumes a lot about the range of characters you might encounter. There are many ways in which utf-8 will break with that case.
Many thanks to all the contributers! You've certainly shown me that I still have a lot to learn, but you have also helped me tremendously in working out my own answer. I'm just putting it here for reference and possible feedback, since there are probably better ways of doing it. To me this was the simplest and most straight forward approach I could find on my own. Enjøy! :)
#!/usr/bin/perl
use strict;
use warnings;
# a list of words for testing
my #list = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
# populate two dimensional array with the list,
# so we can compare each letter with the other letters on the same row more easily
my $list_length = #list;
my #words;
for (my $i = 0; $i < $list_length; $i++) {
my #letters = split(//, $list[$i]);
my $letters_length = #letters;
for (my $j = 0; $j < $letters_length; $j++) {
$words[$i][$j] = $letters[$j];
}
}
# this gives a two-dimensionla array:
#
# #words = ( ["B", "A", "K", "E", "R"],
# ["S", "A", "L", "E", "R"],
# ["B", "A", "L", "E", "R"],
# ["C", "A", "R", "E", "R"],
# ["R", "U", "F", "F", "R"],
# );
# now, on to find the word with most letters in common with the other on the same row
# add up the score for each letter in each word
my $word_length = #words;
my #letter_score;
for my $i (0 .. $#words) {
for my $j (0 .. $#{$words[$i]}) {
for (my $k = 0; $k < $word_length; $k++) {
if ($words[$i][$j] eq $words[$k][$j]) {
$letter_score[$i][$j] += 1;
}
}
# we only want to add in matches outside the one we're testing, therefore
$letter_score[$i][$j] -= 1;
}
}
# sum each score up
my #scores;
for my $i (0 .. $#letter_score ) {
for my $j (0 .. $#{$letter_score[$i]}) {
$scores[$i] += $letter_score[$i][$j];
}
}
# find the highest score
my $max = $scores[0];
foreach my $i (#scores[1 .. $#scores]) {
if ($i > $max) {
$max = $i;
}
}
# and print it all out :D
for my $i (0 .. $#letter_score ) {
print "$list[$i]: $scores[$i]";
if ($scores[$i] == $max) {
print " <- best";
}
print "\n";
}
When run, the script yields the following:
BAKER: 11
SALER: 11
BALER: 12 <- best
CARER: 10
RUFFR: 4

A little help with loops on perl

I am having trouble specifiying the correct algorithm. I am iterating over an input file with loops. The issue that I have is on the last loop.
#!/usr/bin/perl
# Lab #4
# Judd Bittman
# http://www-users.cselabs.umn.edu/classes/Spring-2011/csci3003/index.php?page=labs
# this site has what needs to be in the lab
# lab4 is the lab instructions
# yeast protein is the part that is being read
use warnings;
use strict;
my $file = "<YeastProteins.txt";
open(my $proteins, $file);
my #identifier;
my #qualifier;
my #molecularweight;
my #pi;
while (my $line1 = <$proteins>) {
#print $line1;
chomp($line1);
my #line = split(/\t/, $line1);
push(#identifier, $line[0]);
push(#qualifier, $line[1]);
push(#molecularweight, $line[2]);
push(#pi, $line[3]);
}
my $extreme = 0;
my $ex_index = 0;
for (my $index = 1; $index < 6805; $index++) {
if ( defined($identifier[$index])
&& defined($qualifier[$index])
&& defined($molecularweight[$index])
&& defined($pi[$index])) {
# print"$identifier[$index]\t:\t$qualifier[$index]:\t$molecularweight[$index]:\n$pi[$index]";
}
if ( defined($identifier[$index])
&& defined($qualifier[$index])
&& defined($pi[$index])) {
if (abs($pi[$index] - 7) > $extreme && $qualifier[$index] eq "Verified")
{
$extreme = abs($pi[$index] - 7);
$ex_index = $identifier[$index];
print $extreme. " " . $ex_index . "\n";
}
}
}
print $extreme;
print "\n";
print $ex_index;
print "\n";
# the part above does part b of the assignment
# YLR204W,its part of the Mitochondrial inner membrane protein as well as a processor.
my $exindex = 0;
my $high = 0;
# two lines above and below is part c
# there is an error and I know there is something wrong
for (my $index = 1; $index < 6805; $index++) {
if ( defined($qualifier[$index])
&& ($qualifier[$index]) eq "Verified"
&& defined($molecularweight[$index])
&& (abs($molecularweight[$index]) > $high)) {
$high = (abs($molecularweight[$index]) > $high); # something wrong on this line, I know I wrote something wrong
$exindex = $identifier[$index];
}
}
print $high;
print "\n";
print $exindex;
print "\n";
close($proteins);
exit;
On the final loop I want my loop to hold on to the protein that is verified and has the highest molecular mass. This is in the input file. What code can I use to tell the program that I want to hold the highest number and its name? I feel like I am very close but I can't put my finger on it.
First, a note about perl - in general, it's more common to use foreach style loops rather than c-style indexed loops. For example:
for my $protein (#proteins) {
#do something with $p
}
(Your situation might require it, I just thought I'd mention this)
To address your specific issue though:
$high = (abs($molecularweight[$index])>$high);
$high is being set to the result of the boolean test being performed. Remove the >$high part (which is being tested in your if statement) and you'll likely end up with what you intended.
You likely want a more complex data structure, such as a nested hash. It's hard to give a solid example without more knowledge of the data, but, say your first identifier were abc, the second one was def, etc:
my %protein_entries = (
abc => {
qualifier => 'something',
molecular_weight => 1234,
pi => 'something',
},
def => {
qualifier => 'something else',
molecular_weight => 5678,
pi => 'something else',
},
# …
);
Then, rather than having several different arrays and keeping track of which belongs to which, you get at the elements like so:
Then, if you want to get at the highest by molecular weight, you can sort the identifiers by their molecular weight, then slice off the highest one:
my $highest = (sort {
$protein_entries{$a}{molecular_weight}
<=>
$protein_entries{$b}{molecular_weight}
} keys %protein_entries)[1];
You're having problem with your algorithm because you're not structuring your data properly, basically.
In this example, $highest will hold def, then later you can go back and fetch $protein_entries{def}{molecular_weight} or any of the other keys in the anonymous hash referenced by $protein_entries{def}, thus being easily able to recall any relevant associated data.
Just change:
$high = (abs($molecularweight[$index]) > $high);
To this:
$high = abs($molecularweight[$index]) if (abs($molecularweight[$index]) > $high);
At the end of the loop, $high will be the highest value in $molecularweight array.

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}