Remove "duplicate" formulas - perl

I made a program that will take in 5 numbers and will use the first four to acquire the fifth number as a solution. Solutions can only contain positive integers, and the only operators acceptable are "+ - * /". There are 11 different way that the numbers and operators can be arranged with parentheses. Ex "(n # n) # n # n" where n represents numbers and # represents operators.
I have no problem finding all the solutions, my problem is removing "duplicates". I have been able to remove most duplicates using
%Seen = ();
#solutions = grep { ! $Seen{ $_ }++ } #solutions;
However I am unable to figure out a way to remove "duplicate" formulas.
Using 21 14 2 7 to acquire 34 gives us 4 solutions after the first duplicates have been removed. Here they are
21/7=3; 14+3=17; 2*17=34
21/7=3; 3+14=17; 2*17=34
21/7=3; 3+14=17; 17*2=34
21/7=3; 14+3=17; 17*2=34
My teacher considers these mathematically the same and as such all four of them are just 1 solution. What I can't figure out how to do is find these "duplicates" and remove them. Any help is appreciated, thank you.

A more generic form of the dedupping code you used is
grep !$seen{key($_)}++, ...
In this case, key would be
sub key {
( my $key = $_[0] ) =~ s/(\d+)([*+])(\d+)/ $1 < $3 ? "$1$2$3" : "$3$2$1" /eg;
return $key;
}
In your case, you might want to simply normalise your inputs first
sub normalise(_) {
( my $s = $_[0] ) =~ s/(\d+)([*+])(\d+)/ $1 < $3 ? "$1$2$3" : "$3$2$1" /eg;
return $s;
}
#solutions = grep !$seen{$_}++, map normalise, #solutions;

For example, for commutative operations, only consider x # y where x <= y. This way, 2 * 17 is possible, but 17 * 2 is not.

Related

Binary search—Can't use string "1" as a symbol ref while strict refs is in use

I've been browsing over the already answered questions regarding this error message.
I am trying to solve a problem from the Rosalind web site that looks for some indexes using a binary search.
When my subroutine finds the number it seems to ignore it, and if I try to print the $found variable, it gives me the error
Can't use string "1" as a symbol ref while strict refs is in use
The code is this
sub binarysearch
{
my $numbertolook = shift;
my #intarray=#_;
my $lengthint = scalar #intarray;
my #sorted = sort {$a <=> $b} #intarray;
#print $numbertolook, " " , #sorted, "\n";
my $low=0;
my $high=$lengthint-1;
my $found =undef;
my $midpoint;
while ($low<$high)
{
$midpoint=int(($low+$high)/2);
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ($numbertolook<$sorted[$midpoint])
{
$high=$midpoint;
}
elsif ($numbertolook>$sorted[$midpoint])
{
$low=$midpoint;
}
elsif ($numbertolook==$sorted[$midpoint])
{
$found=1;
print $found "\n";
last;
}
if ($low==$high-1 and $low==$midpoint)
{
if ($numbertolook==$sorted[$high])
{
$found=1;
print $found "\n";
last;
}
$low=$high;
}
}
return $found;
}
You want
print $found, "\n";
Or
print $found . "\n";
With no operator between $found and the newline, it thinks $found is the filehandle to print a newline to, and is getting an error because it isn't a filehandle.
I'll try to help
First of all, as simple as it may seem, a binary search is quite difficult to code correctly. The main reason is that it's a hotbed of off-by-one errors, which are so prevalent that they have their own Wikipedia page
The issue is that an array containing, say, the values A to Z will have 26 elements with indices 0 to 25. I think FORTRAN bucks the trend, and Lua, but pretty much every other language has the first element of an array at index zero
A zero base works pretty well for everything until you start using divide and conquer algorithms. Merge Sort as well as Binary Search are such algorithms. Binary search goes
Is it in the first half?
If so then search the first half further
Else search the second half further
The hard part is when you have to decide when you've found the object, or when you need to give up looking. Splitting data in two nearly-halves is easy. Knowing when to stop is hard
It's highly efficient for sorted data, but the problem comes when implementing it that, if we do it properly, we have to deal with all sorts of weird index bases beyond zero or one.
Suppose I have an array
my #alpha = 'A' .. 'Q'
If I print scalar #alpha I will see 17, meaning the array has seventeen elements, indexed from 0 to 16
Now I'm looking for E in that array, so I do a binary search, so I want the "first half" and the "second half" of #alpha. If I add 0 to 16 and divide by 2 I get a neat "8", so the middle element is at index 8, which is H
But wait. There are 17 elements, which is an odd number, so if we say the first eight (A .. H) are left of the middle and the last eight (I .. Q) are right of the middle then surely the "middle" is I?
In truth this is all a deception, because a binary search will work however we partition the data. In this case binary means two parts, and although the search would be more efficient if those parts could be equal in size it's not necessary for the algorithm to work. So it can be the first third and the last two-thirds, or just the first element and the rest
That's why using int(($low+high)/2) is fine. It rounds down to the nearest integer so that with our 17-element array $mid is a usable 8 instead of 8.5
But your code still has to account for some unexpected things. In the case of our 17-element array we have calculated the middle index to be 8. So indexes 0 .. 7 are the "first half" while 8 .. 16 are the "second half", and the middle index is where the second half starts
But didn't we round the division down? So in the case of an odd number of elements, shouldn't our mid point be at the end of the first half, and not the start of the second? This is an arcane off-by-one error, but let's see if it still works with a simple even number of elements
#alpha = `A` .. `D`
The start and and indices are 0 and 3; the middle index is int((0+3)/2) == 1. So the first half is 0..1 and the second half is 2 .. 3. That works fine
But there's still a lot more. Say I have to search an array with two elements X and Y. That has two clear halves, and I'm looking for A, which is before the middle. So I now search the one-element list X for A. The minimum and maximum elements of the target array are both zero. The mid-point is int((0+0)/2) == 0. So what happens next?
It is similar but rather worse when we're searching for Z in the same list. The code has to be exactly right, otherwise we will be either searching off the end of the array or checking the last element again and again
Saving the worst for last, suppose
my #alpha = ( 'A', 'B, 'Y, 'Z' )
and I'm looking for M. That lest loose all sorts of optimisations that involve checks that may may the ordinary case much slower
Because of all of this it's by far the best solution to use a library or a language's built-in function to do all of this. In particular, Perl's hashes are usually all you need to check for specific strings and any associated data. The algorithm used is vastly better than a binary search for any non-trivial data sets
Wikipedia shows this algorithm for an iterative binary search
The binary search algorithm can also be expressed iteratively with two index limits that progressively narrow the search range.
int binary_search(int A[], int key, int imin, int imax)
{
// continue searching while [imin,imax] is not empty
while (imin <= imax)
{
// calculate the midpoint for roughly equal partition
int imid = midpoint(imin, imax);
if (A[imid] == key)
// key found at index imid
return imid;
// determine which subarray to search
else if (A[imid] < key)
// change min index to search upper subarray
imin = imid + 1;
else
// change max index to search lower subarray
imax = imid - 1;
}
// key was not found
return KEY_NOT_FOUND;
}
And here is a version of your code that is far from bug-free but does what you intended. You weren't so far off
use strict;
use warnings 'all';
print binarysearch( 76, 10 .. 99 ), "\n";
sub binarysearch {
my $numbertolook = shift;
my #intarray = #_;
my $lengthint = scalar #intarray;
my #sorted = sort { $a <=> $b } #intarray;
my $low = 0;
my $high = $lengthint - 1;
my $found = undef;
my $midpoint;
while ( $low < $high ) {
$midpoint = int( ( $low + $high ) / 2 );
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ( $numbertolook < $sorted[$midpoint] ) {
$high = $midpoint;
}
elsif ( $numbertolook > $sorted[$midpoint] ) {
$low = $midpoint;
}
elsif ( $numbertolook == $sorted[$midpoint] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
if ( $low == $high - 1 and $low == $midpoint ) {
if ( $numbertolook == $sorted[$high] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
return;
}
}
return $midpoint;
}
output
FOUND
66
If you call print with several parameters separated with a space print expects the first one to be a filehandle. This is interprented as print FILEHANDLE LIST from the documentation.
print $found "\n";
What you want to do is either to separate with ,, to call it as print LIST.
print $found, "\n";
or to concat as strings, which will also call it as print LIST, but with only one element in LIST.
print $found . "\n";

How can I compare two arrays with alphanumeric elements?

I have two arrays that I want to compare
#array1 = ( aaa, bbb, aaabbb, aaa23bbb, ddd555, 430hd9789);
#array2 = ( 34322hh2, jjfjr78, uuu7shv, ddd555, hjkdjroo);
I have to compare these two arrays and find duplicate and do something about it.
Conditions:
Length of each element in array can be different. There is no such fixed pattern.
Elements can be just numeric i.e. 334343, or just char i.e. "somewordexample", or it can alphanumeric i.e. wewe83493
There can be more such elements in the array.
Now I know the following about comparison operators == and eq:
== is for comparing numbers
eq is for string comparison
How can I compare alphanumeric values?
This is my code so far
for (my $i = 0 ; $i <= $#array1 ; $i++ ) {
for (my $j = 0 ; $j <= $#array2 ; $j++ ) {
if ( $array1[$i] == $arra2[$j] ) {
print "duplicate";
}
}
}
You manner is indolent, and you seem to be looking for a quick fix without caring whether you understand the solution. The posts on Stack Overflow are primarily for people other than the originator who may have a similar problem.
You should read perlfaq4. Specifically:
perldoc -q intersection - "How do I compute the difference of two arrays? How do I compute the intersection of two arrays?"
perldoc -q contained - "How can I tell whether a certain element is contained in a list or array?"
perldoc -q duplicate - "How can I remove duplicate elements from a list or array?"
Thank you for posting your misbehaving code.
There are a few problems
You must always use strict and use warnings at the top of every Perl program, and declare each variable as close as possible to its first point of use. That simple measure will reveal many faults for you that you may otherwise overlook
I have used qw to define the array data
It is much better to use the Perl foreach than the C-style for
As you appear to have discovered, the == operator is for comparing numbers. You have strings so you need eq
Apart from that, all I have changed in your code is to mention the text of the duplicate entry instead of just printing "duplicate"
use strict;
use warnings;
my #array1 = qw( aaa bbb aaabbb aaa23bbb ddd555 430hd9789 );
my #array2 = qw( 34322hh2 jjfjr78 uuu7shv ddd555 hjkdjroo );
for my $i (0 .. $#array1) {
for my $j (0 .. $#array2) {
if ( $array1[$i] eq $array2[$j] ) {
print "Duplicate '$array1[$i]'\n";
}
}
}
output
Duplicate 'ddd555'
Your alphanumeric values can still be treated as strings. If you want to find elements that are in both your lists, you can use the get_intersection function provided by the List::Compare module:
use strict;
use warnings;
use List::Compare;
my #array1 = qw(aaa bbb aaabbb aaa23bbb ddd555 430hd9789);
my #array2 = qw(34322hh2 jjfjr78 uuu7shv ddd555 hjkdjroo);
my $comp = List::Compare->new(\#array1, \#array2);
my #duplicates = $comp->get_intersection();
if (#duplicates > 0) {
print "#duplicates\n";
}
Output:
ddd555
Alphanumeric values are just strings. Numeric values are a subset of those that Perl considers to be numeric (i.e. Scalar::Util::looks_like_number() returns true). In this case, you could use eq or any other string-related function for comparison (such as the less commonly used index).
To find exact duplicates in O(n) time
my %seen;
for my $duplicate (grep { ++$seen{$_} > 1 } (#array1, #array2))
{
# Do what you need to do to the duplicates
}
If you just want to get rid of the elements of #array1 that are duplicated in #array2,
my %seen = map { $_ => 1 } #array2;
#array1 = grep { not $seen{$_} } #array1;
you can do this using exact matching regex:
if("4lph4" =~ /^4lph4$/)
{ .... }

Creating a lazy hashed iterator

In research of this question I peeked at the Iterators chapter in the book Higher Order Perl, and some of the material there was a bit over my head and didn't think necessarily addressed what I specifically want here.
What I mean by lazy hashed iterator is a way to create a structure that would emulate this behavior:
%Ds = {
'1' => 1 .. 20;
'2' => 21 .. 40;
'3' => 41 .. 60;
'4' => 61 .. 80;
...
}
Unfortunately, since this is a hash it would not be in order and thus useless in case of very large numbers.
The behavior is this:
I have a number.
I need to compare it with a sequence of ranges and as a result of the comparison the
code/sub would return another number that is the "key" of that range in case the
number is in that range. (>= with the beginning or <= with the end point of said range)
The "key" of the ranges are numbers from 1..2..3 and so on.
The code/sub will always return for a positive integer no matter how large it is.
By implementing this all lazily I mean if there is a way to emulate this behavior and not compute the sequences of ranges with their respective "keys" with every call of the sub or iteration of a loop. Basically compute once.
Yes it's true that I could choose a maximum boundary, hardcode this in a loop and be done with it but the problem is I don't know of how many of these steps I would need in the end.
Is there a way to do this with perl constructs or maybe perhaps there is a CPAN module that offers this kind of behaviour and my simple search of it didn't uncover it.
Here is a piece of code that illustrates what I mean:
sub get_nr {
my $nr = shift;
my %ds = map { $a = '1' if /1/ .. /20/;
$a = '2' if /21/ .. /40/;
$a = '3' if /41/ .. /60/;
$a = '4' if /61/ .. /80/;
$_ => $a } 1 .. 80;
while (my ($k, $v) = each %ds) {
if ( $k == $nr){
print "number is in range $v \n";
}
}
}
The output for:
get_nr(4);
get_nr(15);
get_nr(22);
get_nr(45);
Is:
number is in range 1
number is in range 1
number is in range 2
number is in range 3
Based on the discussion in the comments, the code you seem to want is a very simple subroutine
sub get_nr {
my $nr = shift;
my $range = int(($nr-1) / 20) + 1;
return $range;
}
You need to compensate for the edge cases, you wanted 20 to return 1, for example, so we need to subtract 1 from the number before dividing it.
If you want to customize further, you might use a variable for the range size, instead of a hard coded number.
sub get_range_number {
my ($n) = #_;
return int(($n-1)/20) + 1;
}
print "$_ is in range ".get_range_number($_)."\n"
for 4, 15, 22, 45;

fast way to compare rows in a dataset

I asked this question in R and got a lot of answers, but all of them crash my 4Gb Ram computer after a few hours running or they take a very long time to finish.
faster way to compare rows in a data frame
Some people said that it's not a job to be done in R. As I don't know C and I'm a little bit fluent in Perl, I'll ask here.
I'd like to know if there is a fast way to compare each row of a large dataset with the other rows, identifying the rows with a specific degree of homology. Let's say for the simple example below that I want homology >= 3.
data:
sample_1,10,11,10,13
sample_2,10,11,10,14
sample_3,10,10,8,12
sample_4,10,11,10,13
sample_5,13,13,10,13
The output should be something like:
output
sample duplicate matches
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3
Matches are calculated when both lines have same numbers on same positions,
perl -F',' -lane'
$k = shift #F;
for my $kk (#o) {
$m = grep { $h{$kk}[$_] == $F[$_] } 0 .. $#F;
$m >=3 or next;
print ++$i, " $kk $k $m";
}
push #o, $k;
$h{$k} = [ #F ];
' file
output,
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3
This solution provides an alternative to direct comparison, which will be slow for large data amounts.
Basic idea is to build an inverted index while reading the data.
This makes comparison faster if there are a lot of different values per column.
For each row, you look up the index and count the matches - this way you only consider the samples where this value actually occurs.
You might still have a memory problem because the index gets as large as your data.
To overcome that, you can shorten the sample name and use a persistent index (using DB_File, for example).
use strict;
use warnings;
use 5.010;
my #h;
my $LIMIT_HOMOLOGY = 3;
while(my $line = <>) {
my #arr = split /,/, $line;
my $sample_no = shift #arr;
my %sim;
foreach my $i (0..$#arr) {
my $value = $arr[$i];
our $l;
*l = \$h[$i]->{$value};
foreach my $s (#$l) {
$sim{$s}++;
}
push #$l, $sample_no;
}
foreach my $s (keys %sim) {
if ($sim{$s}>=$LIMIT_HOMOLOGY) {
say "$sample_no: $s. Matches: $sim{$s}";
}
}
}
For 25000 rows with 26 columns with random integer values between 1 and 100, the program took 69 seconds on my mac book air to finish.

What is the best way to implement the APL compress operator in Perl?

Sometimes my APL familiarity gives me algorithmic ideas for problem solving that I re-implement in a language I have - Perl, for example.
So I have processed a text file to create a boolean vector indicating the used fields in a delimited file, and now I want to output the indexes of those used fields, and the names of the used fields. In APL, I would use the compress operator over the vector of field names, and over the iota of the number of fields.
In Perl, I did this:
my #UsedFieldNames = map { $UsedFields[$_] ? $FieldNames[$_] : () } 0 .. $#UsedFields;
and
say join " ", map { $UsedFields[$_] ? $) : () } 0 .. $#UsedFields;
where #UsedFields is an array containing 0 for unused and 1 for used fields.
I don't really like using map with ?:() to simulate compress - is there a better way (my real program does it a third time when simulating a vertical or reduction over the file)?
I don't really like doing the map over the indexes to get the results - is there a better way to compute that? (I guess one optimization would be to compute the used indexes first, then
#UsedFieldNames = #FieldNames[#UsedIndexes];
The approach with grep or map is the right one, and is what APL would have been using behind the scenes. You can hide that in Perl too with a subroutine:
sub compress (\#\#) {
#{$_[0]}[ grep $_[1][$_] => 0 .. $#{$_[1]} ]
#or use:
# map {$_[1][$_] ? $_[0][$_] : ()} 0 .. $#{$_[0]}
}
my #source = qw(one two three four);
my #ok = qw(0 1 0 1 );
my #new = compress #source, #ok;
say "#new"; # two four
If you are working with array references, you have a few other syntactic options, and in this case I might write it as a scalar method for infix application:
my $compress = sub {
my $src = shift;
my $ok = #_ == 1 && ref $_[0] eq 'ARRAY' ? shift : \#_;
wantarray ? #$src[ grep $$ok[$_] => 0 .. $#$ok ]
: sub{\#_}->(#$src[ grep $$ok[$_] => 0 .. $#$ok ])
};
my $source = [qw(one two three four)];
my $ok = [qw(1 0 1 0 )];
my $new = $source->$compress($ok);
say "#$new"; # one three
say join ' ' => $source->$compress(0, 1, 1, 0); # two three
Other ways:
my #UsedFieldNames = map { ( $FieldNames[$_] ) x !!$UsedFields[$_] } 0..$#UsedFields;
my #UsedFieldNames = #FieldNames[ grep $UsedFields[$_], 0..$#UsedFields ];