I was using the following code to generate all the possible combinations of a permutation and store it in #apriorirow , This works correctly for small $globstring, but produces null #apriorirow for large inputs. How can I solve this problem?
$globstring = join ',', #distinctelements;
#apriorirow = glob "{$globstring}" x $i;
print "$globstring";
What you are generating with glob is neither a combination nor a permutation, but a Cartesian product (also known as the cross product). Storing the Cartesian product in an array will use a lot of memory for non-trivial input data. You can use Set::CrossProduct to iterate through the Cartesian product one tuple at a time:
use Set::CrossProduct;
my $rows = 3;
my #elements = qw(a b c);
my #array;
push #array, \#elements for 1..$rows;
my $iterator = Set::CrossProduct->new(\#array);
while (my $tuple = $iterator->get) {
say join ',', #$tuple;
}
Output:
a,a,a
a,a,b
a,a,c
a,b,a
a,b,b
a,b,c
a,c,a
a,c,b
a,c,c
b,a,a
b,a,b
b,a,c
b,b,a
b,b,b
b,b,c
b,c,a
b,c,b
b,c,c
c,a,a
c,a,b
c,a,c
c,b,a
c,b,b
c,b,c
c,c,a
c,c,b
c,c,c
Of course, you can also iterate with glob:
perl -wE 'say while glob "{a,b,c}" x 3'
will do essentially the same thing. But I find the intent clearer with Set::CrossProduct and prefer working with arrays instead of comma-delimited strings (not to mention the additional features Set::CrossProduct has like peeking at the next tuple without actually iterating).
Related
I am trying to write a function in Perl that computes the cross product (cartesian product) of two Strings. I have similar code in Python that looks like this:
def cross(A, B):
"Cross product of elements in A and elements in B."
return [a+b for a in A for b in B]
How could I mimic this list comprehension in an elegant way?
Here is what I have so far:
# compute the cross product of two Strings
# cross('12','AB') = ((1,A), (1,B), (2,A), (2,B))
sub cross {
# unpack strings
my ($A, $B) = #_;
# array to hold products
my #out_array;
# split strings into arrays
my #A_array = split(//, $A);
my #B_array = split(//, $B);
# glue the characters together and append to output array
for my $r (#A_array) {
for my $c (#B_array) {
push #out_array, [$r . $c];
}
}
return \#out_array;
}
This isn't working exactly as I would expect, for some reason a reference is coming back from split() instead of a List.
Any suggestions or other more elegant cartesian product solutions would be appreciated.
Your problem is in this part:
push #out_array, [$r . $c];
$r . $c concatenates the two scalars to a string. [EXPR] creates an array reference. You don't want a reference, just plain strings:
push #out_array, $r . $c;
If you don't like push, but syntactic sugar, you can use a module that implements gather/take:
my #cross = gather {
for my $x (#A) {
for my $y (#B) {
take $x . $y;
}
}
};
This is implemented e.g. by List::Gather or Syntax::Keyword::Gather.
I myself am fond of elaborate map expressions:
my #cross = map { my $x = $_; map $x.$_, #B } #A;
(same as for with push for all practical purposes).
Note: Perl does not have a concept of “characters” that is related to arrays. When single characters are needed, these are modelled by strings of length 1. Perl arrays always contain scalars, but for (memory) performance reasons strings are not implemented as Perl arrays, but as a pointer to a C array (of known length). The downside is different sets of operations for strings and arrays, the upside is less memory usage.
As characters are just very short strings, to join them we use standard string concatenation with ..
I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
I am using the uniq function exported by the module, List::MoreUtils to find the uniq elements in an array. However, I want it to find the uniq elements in a case insensitive way. How can I do that?
I have dumped the output of the Array using Data::Dumper:
#! /usr/bin/perl
use strict;
use warnings;
use Data::Dumper qw(Dumper);
use List::MoreUtils qw(uniq);
use feature "say";
my #elements=<array is formed here>;
my #words=uniq #elements;
say Dumper \#words;
Output:
$VAR1 = [
'John',
'john',
'JohN',
'JOHN',
'JoHn',
'john john'
];
Expected output should be: john, john john
Only 2 elements, rest all should be filtered since they are the same word, only the difference is in case.
How can I remove the duplicate elements ignoring the case?
Use lowercase, lc with a map statement:
my #uniq_no_case = uniq map lc, #elements;
The reason List::MoreUtils' uniq is case sensitive is that it relies on the deduping characteristics of hashes, which also is case sensitive. The code for uniq looks like so:
sub uniq {
my %seen = ();
grep { not $seen{$_}++ } #_;
}
If you want to use this sub directly in your own code, you could incorporate lc in there:
sub uniq_no_case {
my %seen = ();
grep { not $seen{$_}++ } map lc, #_;
}
Explanation of how this works:
#_ contains the args to the subroutine, and they are fed to a grep statement. Any elements that return true when passed through the code block are returned by the grep statement. The code block consist of a few finer points:
$seen{$_}++ returns 0 the first time an element is seen. The value is still incremented to 1, but after it is returned (as opposed to ++$seen{$_} who would inc first, then return).
By negating the result of the incrementation, we get true for the first key, and false for every following such key. Hence, the list is deduped.
grep as the last statement in the sub will return a list, which in turn is returned by the sub.
map lc, #_ simply applies the lc function to all elements in #_.
Use a hash to keep track of the words you have already seen, but also normalize them for upper/lower case:
my %seen;
my #unique;
for my $w (#words) {
next if $seen{lc($w)}++;
push(#unique, $w);
}
# #unique has the unique words
Note that this will preserve the case of the original words.
UPDATE: As noted in the comments, it's not clear exactly what the OP needs, but I wrote the solution this way to illustrate a general technique for selecting unique representatives from a list under some "equivalence relation." In this case the equivalence relationship is word $a is equivalent to word $b if and only if lc($a) eq lc($b).
Most equivalence relationships can be expressed in this way, that is, the relationship is defined by a classifier function f() such that $a is equivalent to $b if and only if f($a) eq f($b). For instance, if we want to say that two words are equivalent if they have the same length, then f() would be length().
So now you might see why I wrote the algorithm this way - the classifier function may not produce values that are part of the original list. In the case of f = length, we want to select words, but f of a word is a number.
I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?
You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.
All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium
I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.
I noticed Math::Cartesian::Product returns an array of blessed objects instead of a simple array of arrays. I couldn't figure out why. I actually need to do some extra work (unbless) to use the results...
I added a cartesian function to List::Gen recently:
cartesian CODE LIST_of_ARRAYREF
cartesian computes the cartesian product of any number of array refs, each which can be any size. returns a generator
use List::Gen 'cartesian';
my $product = cartesian {$_[0] . $_[1]} [qw/a b/], [1, 2];
print "#$product"; # 'a1 a2 b1 b2'
The "generator" returned is a lazy tied array that will generate values when asked for them. There are also iterative and other accessor methods:
my $pairs = cartesian {#_} [qw/$ # %/], ['a'..'z'], [1 .. 3];
while (my #tuple = $pairs->next) { # $pairs->reset; #$pairs->index = 5; ...
print #tuple, ', ';
}
# $a1, $a2, $a3, $b1, $b2, $b3, $c1, $c2, $c3, $d1, $d2, $d3, $e1 ...
I don't know how large the sets you will be working with are, but the advantage to using the above approach is that the storage requirements for the generator remain O(1)
my $digits = cartesian {join '' => #_} ([0..9]) x 10;
say for #$digits[10**9 - 3 .. 10**9 + 3];
# 0999999998
# 0999999999
# 1000000000
# 1000000001
# 1000000002
# 1000000003
which calculates only 6 elements of the set, and stores nothing.
As you can see from the examples, the return value of cartesian itself is a generator object, but that object's subsequent return values are whatever the coderef passed to cartesian returns. So if you want array references, it's as simple as: cartesian {\#_} ...
Also, what extra work did you have to do to deal with the blessed reference? A blessed array is still an array in every sense except for what ref will return. If you are writing switch logic based on reference type, Scalar::Util's reftype is what you should use.
One alternative is the module Set::CrossProduct, which will yield ordinary, unblessed array references:
use Set::CrossProduct;
my $iter = Set::CrossProduct->new([ \#foo, \#bar ]);
while (my $tuple = $iter->get){
...
}
Or get all tuples at once:
my #tuples = $iter->combinations;
It blesses the arrays returned by cartesian so that when some code as the following is run
$b = $cartesian $a1, $a2;
$c = $cartesian $b, $a3;
... it can detect that $b is the result of a previous call to the module.
Doing a cartesian product operation is something trivial, if the data as returned by that module does not fit your needs, consider writting the operation yourself from scratch.
Anyway, inspecting the module source code reveals it is not too great.