How do I sort an array of strings given an arbitrary ordering of those strings? - perl

I wish to sort an array of strings so that the strings wind up in the following order:
#set = ('oneM', 'twoM', 'threeM', 'sixM', 'oneY', 'twoY', 'oldest');
As you may notice, these represent time periods so oneM is the first month, etc. My problem is that I want to sort by the time period, but with the strings as they are I can't just use 'sort', so I created this hash to express how the strings should be ordered:
my %comparison = (
oneM => 1,
twoM => 2,
threeM => 3,
sixM => 6,
oneY => 12,
twoY => 24,
oldest => 25,
);
This I was hoping would make my life easier where I can do something such as:
foreach my $s (#set) {
foreach my $k (%comparison) {
if ($s eq $k) {
something something something
I'm getting the feeling that this is a long winded way of doing things and I wasn't actually sure how I would actually sort it once I've found the equivalent... I think I'm missing my own plot a bit so any help would be appreciated
As requested the expected output would be like how it is shown in #set above. I should have mentioned that the values in #set will be part of that set, but not necessarily all of them and not in the same order.

You've choose good strategy in precomputing data to form easy to sort. You can calculate this data right inside sorting itself, but then you'd be wasting time for recalculation each time sort needs to compare value, which happens more than once through process. On the other hand, the drawback of cache is, obviously, that you'd need additional memory to store it and it might slow down your sort under low memory condition, despite doing less calculations overall.
With your current set up sorting is as easy as:
my #sorted = sort { $comparison{$a} <=> $comaprison{$b} } #set;
While if you want to save memory at expense of CPU it'd be:
my #sorted = sort { calculate_integer_based_on_input{$a} <=> calculate_integer_based_on_input{$b} } #set;
with separate calculate_integer_based_on_input function that would convert oneY and the like to 12 or other corresponding value on the fly or just inline conversion of input to something suitable for sorting.
You might also want to check out common idioms for sorting with caching computations, like Schwartzian transform and Guttman Rosler Transform.

Giving an example with the input and you expected result would help. I guess that this is what you are looking for:
my #data = ( ... );
my %comparison = (
oneM => 1, twoM => 2, threeM => 3,
sixM => 6, oneY => 12, twoY => 24,
oldest => 25,
);
my #sorted = sort { $comparison{$a} <=> $comaprison{$b} } #data;
There are plenty of examples in the documentation for the sortfunction in the perlfunc manual page. ("perldoc -f sort")

Related

Scala: For loop that matches ints in a List

New to Scala. I'm iterating a for loop 100 times. 10 times I want condition 'a' to be met and 90 times condition 'b'. However I want the 10 a's to occur at random.
The best way I can think is to create a val of 10 random integers, then loop through 1 to 100 ints.
For example:
val z = List.fill(10)(100).map(scala.util.Random.nextInt)
z: List[Int] = List(71, 5, 2, 9, 26, 96, 69, 26, 92, 4)
Then something like:
for (i <- 1 to 100) {
whenever i == to a number in z: 'Condition a met: do something'
else {
'condition b met: do something else'
}
}
I tried using contains and == and =! but nothing seemed to work. How else can I do this?
Your generation of random numbers could yield duplicates... is that OK? Here's how you can easily generate 10 unique numbers 1-100 (by generating a randomly shuffled sequence of 1-100 and taking first ten):
val r = scala.util.Random.shuffle(1 to 100).toList.take(10)
Now you can simply partition a range 1-100 into those who are contained in your randomly generated list and those who are not:
val (listOfA, listOfB) = (1 to 100).partition(r.contains(_))
Now do whatever you want with those two lists, e.g.:
println(listOfA.mkString(","))
println(listOfB.mkString(","))
Of course, you can always simply go through the list one by one:
(1 to 100).map {
case i if (r.contains(i)) => println("yes: " + i) // or whatever
case i => println("no: " + i)
}
What you consider to be a simple for-loop actually isn't one. It's a for-comprehension and it's a syntax sugar that de-sugares into chained calls of maps, flatMaps and filters. Yes, it can be used in the same way as you would use the classical for-loop, but this is only because List is in fact a monad. Without going into too much details, if you want to do things the idiomatic Scala way (the "functional" way), you should avoid trying to write classical iterative for loops and prefer getting a collection of your data and then mapping over its elements to perform whatever it is that you need. Note that collections have a really rich library behind them which allows you to invoke cool methods such as partition.
EDIT (for completeness):
Also, you should avoid side-effects, or at least push them as far down the road as possible. I'm talking about the second example from my answer. Let's say you really need to log that stuff (you would be using a logger, but println is good enough for this example). Doing it like this is bad. Btw note that you could use foreach instead of map in that case, because you're not collecting results, just performing the side effects.
Good way would be to compute the needed stuff by modifying each element into an appropriate string. So, calculate the needed strings and accumulate them into results:
val results = (1 to 100).map {
case i if (r.contains(i)) => ("yes: " + i) // or whatever
case i => ("no: " + i)
}
// do whatever with results, e.g. print them
Now results contains a list of a hundred "yes x" and "no x" strings, but you didn't do the ugly thing and perform logging as a side effect in the mapping process. Instead, you mapped each element of the collection into a corresponding string (note that original collection remains intact, so if (1 to 100) was stored in some value, it's still there; mapping creates a new collection) and now you can do whatever you want with it, e.g. pass it on to the logger. Yes, at some point you need to do "the ugly side effect thing" and log the stuff, but at least you will have a special part of code for doing that and you will not be mixing it into your mapping logic which checks if number is contained in the random sequence.
(1 to 100).foreach { x =>
if(z.contains(x)) {
// do something
} else {
// do something else
}
}
or you can use a partial function, like so:
(1 to 100).foreach {
case x if(z.contains(x)) => // do something
case _ => // do something else
}

Best way to store data in to hash for flexible "pivot-table" like calculations

I have a data set with following fields.
host name, model, location, port number, activated?, up?
I would convert them into a hash structure (perhaps similar to below)
my %switches = (
a => {
"hostname" => "SwitchA",
"model" => "3750",
"location" => "Building A"
"total_ports" => 48,
"configured_ports" => 30,
"used_ports" => 24,
},
b => {
"hostname" => "SwitchB",
"model" => "3560",
"location" => "Building B"
"total_ports" => 48,
"configured_ports" => 36,
"used_ports" => 20,
},
},
);
In the end I want to generate statistics such as:
No. of switches per building,
No. of switches of each model per building
Total no. of up ports per building
The statistics may not be just restricted to building wise, may be even switch based (i.e, no. of switches 95% used etc.,). With the given data structure how can I enumerate those counters?
Conversely, is there a better way to store my data? I can think of at least one format:
<while iterating over records>
{
hash{$location}->{$model_name}->count++;
if ($State eq 'Active') {hash{$location}->{up_ports}->count++};
What would be the better way to go about this? If I chose the first format (where all information is intact inside the hash) how can I mash the data to produce different statistics? (some example code snippets would be of great help!)
If you want querying flexibility, a "database" strategy is often good. You can do that directly, by putting the data into something like SQLite. Under that approach, you would be able to issue a wide variety of queries against the data without much coding of your own.
Alternatively, if you're looking for a pure Perl approach, the way to approximate a database table is by using an array-of-arrays or, even better for code readability, an array-of-hashes. The outer array is like the database table. Each hash within that array is like a database record. Your Perl-based queries would end up looking like this:
my #query_result = grep {
$_->{foo} == 1234 and
$_->{bar} eq 'fubb'
} #data;
If you have so many rows that query performance becomes a bottleneck, you can create your own indexes, using a hash.
%data_by_switch = (
'SwitchA' => [0, 4, 13, ...], # Subscripts to #data.
'SwitchB' => [1, 12, ...],
...
);
My answer is based on answers I received for this question, which has some similarities with your question.
As far as I can see you have a list of tuples, for the sake of the discussion it is enough to consider objects with 2 attributes, for example location and ports_used. So, for example:
(["locA", 23], ["locB", 42], ["locA", 13]) # just the values as tuples, no keys
And you want a result like:
("locA" => 36, "locB" => 42)
Is this correct? If so, what is the problem you are facing?

Wanted: a quicker way to check all combinations within a very large hash

I have a hash with about 130,000 elements, and I am trying to check all combinations within that hash for something (130,000 x 130,000 combinations). My code looks like this:
foreach $key1 (keys %CNV)
{
foreach $key2 (keys %CNV)
{
if (blablabla){do something that doesn't take as long}
}
}
As you might expect, this takes ages to run. Does anyone know a quicker way to do this? Many thanks in advance!!
-Abdel
Edit: Update on the blablabla.
Hey guys, thanks for all the feedback! Really appreciate it. I changed the foreach statement to:
for ($j=1;$j<=24;++$j)
{
foreach $key1 (keys %{$CNV{$j}})
{
foreach $key2 (keys %{$CNV{$j}})
{
if (blablabla){do something}
}
}
}
The hash is now multidimensional:
$CNV{chromosome}{$start,$end}
I'll elaborate on what I'm exactly trying to do, as requested.
The blablabla is the following:
if ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
(($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1}))
)
In short: The hash elements represent a specific part of the DNA (a so called "CNV", think of it like a gene for now), with a start and an end (which are integers representing their position on that particular chromosome, stored in hashes with the same keys: %CNVstart & %CNVend). I'm trying to check for every combination of CNVs whether they overlap. If there are two elements that overlap within a family (I mean a family of persons whose DNA I have and read in; there is also a for-statement inside the foreach-statement that let's the program check this for every family, which makes it last even longer), I check whether they also have the same "copy number" (which is stored in another hash with the same keys) and print out the result.
Thank you guys for your time!
It sounds like Algorithm::Combinatorics may help you here. It's intended to provide "efficient generation of combinatorial sequences." From its docs:
Algorithm::Combinatorics is an
efficient generator of combinatorial
sequences. ... Iterators do not use
recursion, nor stacks, and are written
in C.
You could use its combinations sub-routine to provide all possible 2 key combos from your full set of keys.
On the other hand, Perl itself is written in C. So I honestly have no idea whether or not this would help at all.
Maybe by using concurrency? But you would have to be carefull with what you do with a possitive match as to not get problems.
E.g. take $key1, split it in $key1A and §key1B. The create two separate threads, each containing "half of the loop".
I am not sure exactly how expensive it is to start new threads in Perl but if your positive action doesn't have to be synchronized I imagine that on matching hardware you would be faster.
Worth a try imho.
define blah blah.
You could write it like this:
foreach $key1 (keys %CNV)
{
if (blah1)
{
foreach $key2 (keys %CNV)
{
if (blah2){do something that doesn't take as long}
}
}
}
This pass should be O(2N) instead of O(N^2)
The data structure in the question is not a good fit to the problem. Let's try it this way.
use Set::IntSpan::Fast::XS;
my #CNV;
for ([3, 7], [4, 8], [9, 11]) {
my $set = Set::IntSpan::Fast::XS->new;
$set->add_range(#{$_});
push #CNV, $set;
}
# The comparison is commutative, so we can cut the total number in half.
for my $index1 (0 .. -1+#CNV) {
for my $index2 (0 .. $index1) {
next if $index1 == $index2; # skip if it's the same CNV
say sprintf(
'overlap of CNV %s, %s at indices %d, %d',
$CNV[$index1]->as_string, $CNV[$index2]->as_string, $index1, $index2
) unless $CNV[$index1]->intersection($CNV[$index2])->is_empty;
}
}
Output:
overlap of CNV 4-8, 3-7 at indices 1, 0
We will not get the overlap of 3-7, 4-8 because it is a duplicate.
There's also Bio::Range, but it doesn't look so efficient to me. You should definitely get in touch with the bio.perl.org/open-bio people; chances are what you're doing has been done already a million times before they already have the optimal algorithm all figured out.
I think I found the answer :-)
Couldn't have done it without you guys though. I found a way to skip most of the comparisons I make:
for ($j=1;$j<=24;++$j)
{
foreach $key1 (sort keys %{$CNV{$j}})
{
foreach $key2 (sort keys %{$CNV{$j}})
{
if (($CNVstart{$j}{$key2} < $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} < $CNVstart{$j}{$key1}))
{
next;
}
if (($CNVstart{$j}{$key2} > $CNVend{$j}{$key1}) && ($CNVend{$j}{$key2} > $CNVend{$j}{$key1}))
{
last;
}
if ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
(($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1}))
) {print some stuff out}
}
}
}
What I did is:
sort the keys of the hash for each foreach loop
do "next" if the CNVs with $key2 still haven't reached the CNV with $key1 (i.e. start2 and end2 are both smaller than start1)
and probably the most time-saving: end the foreach loop if the CNV with $key2 has overtaken the CNV with $key1 (i.e. start2 and end2 are both larger than end1)
Thanks a lot for your time and feedback guys!
Your optimisation with taking out the j into the outer loop was good, but the solution is still far from optimal.
Your problem does have a simple O(N+M) solution where N is the total number of CNVs and M is the number of overlaps.
The idea is: you walk through the length of DNA while keeping track of all the "current" CNVs. If you see a new CNV start, you add it to the list and you know that it overlaps with all the other CNVs currently in the list. If you see a CNV end, you just remove it from the list.
I am not a very good perl programmer, so treat the following as a pseudo-code (it's more like a mix of Java and C# :)):
// input:
Map<CNV, int> starts;
Map<CNV, int> ends;
// temporary:
List<Tuple<int, bool, CNV>> boundaries;
foreach(CNV cnv in starts)
boundaries.add(starts[cnv], false, cnv);
foreach(CNV cnv in ends)
boundaries.add(ends[cnv], true, cnv);
// Sort first by position,
// then where position is equal we put "starts" first, "ends" last
boundaries = boundaries.OrderBy(t => t.first*2 + (t.second?1:0));
HashSet<CNV> current;
// main loop:
foreach((int position, bool isEnd, CNV cnv) in boundaries)
{
if(isEnd)
current.remove(cnv);
else
{
foreach(CNV otherCnv in current)
OVERLAP(cnv, otherCnv); // output of the algorithm
current.add(cnv);
}
}
Now I'm not a perl warrior, but based on the information given it is the same in any programming language; unless you sort the "hash" on the property you want to check and do a binary lookup you won't improve any performance in a lookup.
You can also if it is possible calculate which indexes in your hash would have the properties you are interested in, but as you have no information regarding such a possibility, this would perhaps not be a solution.

How can I filter a Perl DBIx recordset with 2 conditions on the same column?

I'm getting my feet wet in DBIx::Class — loving it so far.
One problem I am running into is that I want to query records, filtering out records that aren't in a certain date range.
It took me a while to find out how to do a <= type of match instead of an equality match:
my $start_criteria = ">= $start_date";
my $end_criteria = "<= $end_date";
my $result = $schema->resultset('MyTable')->search(
{
'status_date' => \$start_criteria,
'status_date' => \$end_criteria,
});
The obvious problem with this is that since the filters are in a hash, I am overwriting the value for "status_date", and am only searching where the status_date <= $end_date. The SQL that gets executed is:
SELECT me.* from MyTable me where status_date <= '9999-12-31'
I've searched CPAN, Google and SO and haven't been able to figure out how to apply 2 conditions to the same column. All documentation I've been able to find shows how to filter on more than 1 column, but not 2 conditions on the same column.
I'm sure I'm missing something obvious. Can someone here point it out to me?
IIRC, you should be able to pass an array reference of multiple search conditions (each in its own hashref.) For example:
my $result = $schema->resultset('MyTable')->search(
[ { 'status_date' => \$start_criteria },
{ 'status_date' => \$end_criteria },
]
);
Edit: Oops, nervermind. That does an OR, as opposed to an AND.
It looks like the right way to do it is to supply a hashref for a single status_date:
my $result = $schema->resultset('MyTable')->search(
{ status_date => { '>=' => $start_date, '<=' => $end_date } }
);
This stuff is documented in SQL::Abstract, which DBIC uses under the hood.
There is BETWEEN in SQL and in DBIx::Class it's supported:
my $result = $schema->resultset('MyTable')
->search({status_date => {between => [$start_date,$end_date]}});

How can I generate all subsets of a list in Perl?

I have a mathematical set in a Perl array: (1, 2, 3). I'd like to find all the subsets of that set: (1), (2), (3), (1,2), (1,3), (2,3).
With 3 elements this isn't too difficult but if set has 10 elements this gets tricky.
Thoughts?
You can use Data::PowerSet like Matthew mentioned. However, if, as indicated in your example, you only want proper subsets and not every subset, you need to do a little bit more work.
# result: all subsets, except {68, 22, 43}.
my $values = Data::PowerSet->new({max => 2}, 68, 22, 43);
Likewise, if you want to omit the null set, just add the min parameter:
# result: all subsets, except {} and {68, 22, 43}.
my $values = Data::PowerSet->new({min => 1, max => 2}, 68, 22, 43);
Otherwise, to get all subsets, just omit both parameters:
# result: every subset.
my $values = Data::PowerSet->new(68, 22, 43);
See Data::PowerSet, http://coding.derkeiler.com/Archive/Perl/comp.lang.perl/2004-01/0076.html , etc.
Since you say "mathematical set", I assume you mean there are no duplicates.
A naive implementation that works for up to 32 elements:
my $set = [1,2,3];
my #subsets;
for my $count ( 1..(1<<#$set)-2 ) {
push #subsets, [ map $count & (1<<$_) ? $set->[$_] : (), 0..$#$set ];
}
(For the full range of subsets, loop from 0 to (1<<#$set)-1; excluding 0 excludes the null set, excluding (1<<#$set)-1 excludes the original set.)
Update: I'm not advocating this over using a module, just suggesting it in case you are looking to understand how to go about such a problem. In general, each element is either included or excluded from any given subset. You want to pick an element and generate first all possible subsets of the other elements not including your picked element and then all possible subsets of the other elements including your picked element. Recursively apply this to the "generate all possible subsets". Finally, discard the null subset and the non-proper subset. In the above code, each element is assigned a bit. First all subsets
are generated with the high bit on, then all those with it off. For each of those alternatives, subsets are generated first with the next-to-highest bit off, then on. Continuing this until you are just working on the lowest bit, what you end up with is all the possible numbers, in order.
If you don't want to use an existing module or can't then you can simply code your own subset generation algorithm using a bit-mask and a binary counter. Sample code follows -
#!/usr/bin/perl
use strict;
use warnings;
my #set = (1, 2, 3);
my #bitMask = (0, 0, 0); #Same size as #set, initially filled with zeroes
printSubset(\#bitMask, \#set) while ( genMask(\#bitMask, \#set) );
sub printSubset {
my ($bitMask, $set) = #_;
for (0 .. #$bitMask-1) {
print "$set->[$_]" if $bitMask->[$_] == 1;
}
print"\n";
}
sub genMask {
my ($bitMask, $set) = #_;
my $i;
for ($i = 0; $i < #$set && $bitMask->[$i]; $i++) {
$bitMask->[$i] = 0;
}
if ($i < #$set) {
$bitMask->[$i] = 1;
return 1;
}
return 0;
}
Note: I haven't been able to test the code, some bugs might need to be ironed out.
Use Algorithm::ChooseSubsets.
It's a counting problem - for N elements there are exactly 2^N subsets and you have to count from 0 to 2^N - 1 in binary to list them all.
For eg 3 items there are 8 possible subsets: 000, 001, 010, 011, 100, 101, 110 and 111 - the numbers show which members are present.