Perl forming string random string combination - perl

I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?

You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.

All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium

I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.

Related

Sort the column values and search the value

Input to my script is this file which contains data as below.
A food 75
B car 136
A car 69
A house 179
B food 75
C car 136
C food 85
For each distinct value of the second column, I want to print any line where the number in the third column is different.
Example output
C food 85
A car 69
Here is my Perl code.
#! /usr/local/bin/perl
use strict;
use warning;
my %data = ();
open FILE, '<', 'data.txt' or die $!;
while ( <FILE> ) {
chomp;
$data{$1} = $2 while /\s*(\S+),(\S+)/g;
}
close FILE;
print $_, '-', $data{$_}, $/ for keys %data;
I am able to print the hash keys and values, but not able to get the desired output.
Any pointers on how to do that using Perl?
As far as I can tell from your question, you want a list of all the lines where there is an "odd one out" with the same item type and a different number in the third column from all the rest
I think this is what you need
It reads all the data into hash %data, so that $data{$type}{$n} is a (reference to an) array of all the data lines that use that object type and number
Then the hash is scanned again, looking for and printing all instances that have only a single line with the given type/number and where there are other values for the same object type (otherwise it would be the only entry and not an "odd one out")
use strict;
use warnings 'all';
use autodie;
my %data;
open my $fh, '<', 'data.txt';
while ( <$fh> ) {
my ( $label, $type, $n) = split;
push #{ $data{$type}{$n} }, $_;
}
for my $type ( keys %data ) {
my $items = $data{$type};
next unless keys %$items > 1;
for my $n ( keys %$items ) {
print $items->{$n}[0] if #{ $items->{$n} } == 1;
}
}
output
C food 85
A car 69
Note that this may print multiple lines for a given object type if the input looks like, say
B car 22
A car 33
B car 136
C car 136
This has two "odd ones out" that appear only once for the given object type, so both B car 22 and A car 33 will be printed
Here are the pointers:
First, you need to remember lines somewhere before outputting them.
Second, you need to discard previously remembered line for the object according to the rules you set.
In your case, the rule is to discard when the number for the object differs from the previous remembered.
Both tasks can be accomplished with the hash.
For each line:
my ($letter, $object, $number)=split /\s+/, $line;
if (!defined($hash{$object}) || $hash{$object}[0]!=$number) {
$hash{$object}=[$number, $line];
}
Third, you need to output the hash:
for my $object(keys %hash) {
print $hash{$object}[1];
}
But there is the problem: a hash is an unordered structure, it won't return its keys in the order you put them into the hash.
So, the fourth: you need to add the ordering to your hash data, which can be accomplished like this:
$hash{$object}=[$number,$line,$.]; # $. is the row number over all the input files or STDIN, we use it for sorting
And in the output part you sort with the stored row number
(see sort for details about $a, $b variables):
for my $object(sort { $hash{$a}[2]<=>$hash{$b}[2] } keys %hash) {
print $hash{$object}[1];
}
Regarding the comments
I am certain that my code does not contain any errors.
If we look at the question before it was edited by some high rep users, it states:
[cite]
Now where if Numeric column(Third) has different value (Where in 2nd column matches) ...Then print only the mismatched number line. example..
A food 75
B car 136
A car 69
A house 179
B food 75
B car 136
C food 85
Example output (As number columns are not matching)
C food 85
[/cite]
I can only interpret that print only the mismatched number line as: to print the last line for the object where the number changed. That clearly matches the example the OP provided.
Even so, in my answer I addressed the possibility of misinterpretation, by stating that line omitting is done according to whatever rules the OP wants.
And below that I indicated what was the rule by that time in my opinion.
I think it well addressed the OP problem, because, after all, the OP wanted the pointers.
And now my answer is critiqued because it does not match the edited (long after and not by OP) requirements.
I disagree.
Regarding the whitespace: specifying /\s+/ for split is not an error here, despite of some comments trying to assert that.
While I agree that " " is common for split, I would disagree that there are a lot of cases where you must use " " instead of /\s+/.
/\s+/ is a regular expression which is the conventional argument for split, while " " is the shorthand, that actually masks the meaning.
With that I decided to use explicit split /\s+/, $line in my example instead of just split " ", $line or just split specifically to show the innerworkings of perl.
I think it is important to any one new to perl.
It is perfectly ok to use /\s+/, but be careful if you expect to have leading whitespace in your data, consult perldoc -f split and decide whether /\s+/ suits your needs or not.

How can I split a word into its component letters?

I am working with Perl and I have an array with only one word:
#example = ("helloword")
I want to generate another array in which each element is a letter from the word:
#example2 = ("h", "e", "l"...)
I need to do that because I need to count the numbers of "h", "e"... How can I do this?
To count how many times letter occurred in a string,
print "helloword" =~ tr/h//; # for 'h' letter
otherwise you can split string and assign list to an array,
my #example2 = split //, $example[0];
I don't completely grasp exactly what you need to count, but perhaps you can take pieces from this example, which uses a hash to store the letters and counts of each...
use warnings;
use strict;
my #array = 'helloworld';
my %letters;
$letters{$_}++ for split //, $array[0];
my $total;
while (my ($k, $v) = each %letters){
$total += $v;
print "$k: $v\n";
}
print "Total letters in string: $total\n",
Output:
w: 1
d: 1
l: 3
o: 2
e: 1
r: 1
h: 1
Total letters in string: 10
Try using this code, found here: http://www.comp.leeds.ac.uk/Perl/split.html
#chars = split(//, $word);
You can of course use split(//,"helloworld"), but that's not as efficient as unpack. Figuring out the template to provide to unpack can be somewhat steep, but this should work for you: unpack('(A)*',"helloworld"). For example:
perl -e 'print(join("\n",unpack("(A)*","helloworld")),"\n")'
h
e
l
l
o
w
o
r
l
d
To count the number of letters, you could either assume that every character of a "word" you split the string up into is a letter and simply evaluate the list in scalar context (or use 'length'), e.g. print(scalar(#letters),"\n"); or print(length(#letters),"\n"), OR you could create a count variable and increment it in a map when a letter pattern is matched, e.g.:
my $cnt = 0;
foreach(#chars){$cnt++ if(/\w/)}
print("$cnt\n");
Or you can use the same evaluation of a list in scalar trick with a grep:
print(scalar(grep {/\w/} #chars),"\n");
There are of course, in perl, other ways to do it.
EDIT: In case I misinterpreted the question, and you want to know how many of each letter there is in the string, then this should suffice:
$cnt = 0;
foreach(unpack("(A)*","helloworld")))
{
next unless(/\w/);
$hash->{$_}->{ORD} = $cnt++ unless(exists($hash->{$_}));
$hash->{$_}->{CNT}++;
}
foreach(sort {$hash->{$a}->{ORD} <=> $hash->{$b}->{ORD}}
keys(%$hash))
{print("$_\t$hash->{$_}->{CNT}\n")}
This solution has the advantage of keeping the unique letters in the order of their first occurrence in the word they were found in.

Glob for generating all permutations

I was using the following code to generate all the possible combinations of a permutation and store it in #apriorirow , This works correctly for small $globstring, but produces null #apriorirow for large inputs. How can I solve this problem?
$globstring = join ',', #distinctelements;
#apriorirow = glob "{$globstring}" x $i;
print "$globstring";
What you are generating with glob is neither a combination nor a permutation, but a Cartesian product (also known as the cross product). Storing the Cartesian product in an array will use a lot of memory for non-trivial input data. You can use Set::CrossProduct to iterate through the Cartesian product one tuple at a time:
use Set::CrossProduct;
my $rows = 3;
my #elements = qw(a b c);
my #array;
push #array, \#elements for 1..$rows;
my $iterator = Set::CrossProduct->new(\#array);
while (my $tuple = $iterator->get) {
say join ',', #$tuple;
}
Output:
a,a,a
a,a,b
a,a,c
a,b,a
a,b,b
a,b,c
a,c,a
a,c,b
a,c,c
b,a,a
b,a,b
b,a,c
b,b,a
b,b,b
b,b,c
b,c,a
b,c,b
b,c,c
c,a,a
c,a,b
c,a,c
c,b,a
c,b,b
c,b,c
c,c,a
c,c,b
c,c,c
Of course, you can also iterate with glob:
perl -wE 'say while glob "{a,b,c}" x 3'
will do essentially the same thing. But I find the intent clearer with Set::CrossProduct and prefer working with arrays instead of comma-delimited strings (not to mention the additional features Set::CrossProduct has like peeking at the next tuple without actually iterating).

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

Perl hash when both the keys and the values are array references

I have a problem where pairs of numbers map to other pairs of numbers. For instance, (1,2)->(12,97). Some pairs may map to multiple other pairs, so what I really need is the ability to map a pair into a list of lists, like (1,2)->((12,97),(4,1)). At the end of the day I want to process each of the values (i.e., each list of lists) separately.
In Python, I could do this by simply saying:
key = ( x, y )
val = [ a, b ]
if (x,y) not in my_dict:
my_dict[ (x,y) ] = []
my_dict[ (x,y) ].append( [a,b] )
However, in Perl, I have to use refs for the keys and values. So I can certainly say:
$keyref = [ x1, y1 ]
$valref = [ a, b ]
%my_hash = { $keyref => $valref }
But what happens when another pair (x2,y2) comes along? Even if x2==x1 and y2==y1, $keyref=[x2,y2] will differ from the previous keyref generated, so I do not see a way to do the lookup. Of course, I could compare (x2,y2) with each dereferenced hash key, but after all, God gave us hash tables precisely to avoid the need to do so.
Is there a Perl solution?
Thanks,
-W.
In Perl, all hash keys are strings, or are "stringified" before lookup. Using an array reference as a key is usually the wrong approach.
What about using a "two-dimensional" hash?
$hash{$x1}{$y1} = [ $a, $b ];
# or
%hash = ( $x1 => { $y1 => [ $a, $b ] } );
($x2,$y2)=($x1,$y1);
print #{$hash{$x2}{$y2}}; # will print $a and $b
Like most things in Perl, TMTOWTDI.
Option 1: Use multidimensional array emulation
$hash{$x,$y} = [$a, $b];
See also the documentation for the built-in variable $;.
Option 2: Use the Hash::MultiKey module
tie %hash, 'Hash::MultiKey';
$hash{[$x, $y]} = [$a, $b];
Option 3: Use a HoH (hash of hashes) instead
$hash{$x}{$y} = [$a, $b];
I ended up using Socket Puppet's solution (in the form of Michael Carmen's Option 3). FYI, here is a little Perl script that carries out all the operations I need in my app.
Printed lines 2:,3: and 4:,5: just use different syntax to do the same thing, and lines 0: and 1: were just intended as sanity checks along the way.
What this this adds to the suggested solution is the use of an array of arrays as the value that goes along with a key.
#k1 = ( 12, 13 );
$aref = [ 11, 22 ];
$bref = [ 33, 44 ];
%h = {};
if( not exists $h{$k1[0]}{$k1[1]} ) {
print "initializing\n";
$h{$k1[0]}{$k1[1]} = [];
}
push #{$h{$k1[0]}{$k1[1]}}, $aref;
push #{$h{$k1[0]}{$k1[1]}}, $bref;
print "0: ", join ':', #{$h{$k1[0]}{$k1[1]}}, "\n";
print "1: ", join ':', ${$h{$k1[0]}{$k1[1]}}[0], "\n";
print "2: ", join ':', #{${$h{$k1[0]}{$k1[1]}}[0]}, "\n";
print "3: ", join ':', #{${$h{$k1[0]}{$k1[1]}}[1]}, "\n";
print "4: ", join ':', #{$h{$k1[0]}{$k1[1]}->[0]}, "\n";
print "5: ", join ':', #{$h{$k1[0]}{$k1[1]}->[1]}, "\n";
P.S. I would have added this as a comment but it was too long, and I thought it made sense to include a worked example.