Perl CPAN module fisher exact test - perl

Is there any module in CPAN that can provide a method to compute the Fishers exact tests?
example in R:
in a 2x2 contingency table like:
17 12
8842559 10003821
fisher.test(matrix(data = c(17,8842559,12,10003821), nrow = 2))
Fisher's Exact Test for Count Data
data: counts
p-value = 0.2642
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7213591 3.6778630
sample estimates:
odds ratio
1.602697
I used Text::NSP::Measures::2D::Fisher module, but I am not sure it does the same as above.
use Text::NSP::Measures::2D::Fisher::twotailed;
my $npp = 10003821;
my $n1p = 8842559;
my $np1 = 12;
my $n11 = 17;
my $twotailed_value = calculateStatistic(
n11 => $n11,
n1p => $n1p,
np1 => $np1,
npp => $npp,
);
if( (my $errorCode = getErrorCode()) ) {
print STDERR $errorCode, " - ", getErrorMessage();
} else {
print getStatisticName, "value for bigram is ", $twotailed_value, "\n";
}
but it does not give me anything

The matrix differ from R to perl. You can't use the same matrix !
for perl this is the matrix (into brackets):
word2 ~word2
word1 [n11] n12 | [n1p]
~word1 n21 n22 | n2p
--------------
[np1] np2 [npp]
For R this is the matrix (into brackets):
word2 ~word2
word1 [n11] [n12] | n1p
~word1 [n21] [n22] | n2p
--------------
np1 np2 npp

Related

What unicode characters are suitable for making a heatmap in console?

I want to make a command that takes a matrix or vectors of numbers and prints a string-representation where every number is mapped to a character with varying darkness depending on its value.
The only characters I've found that gives a consistent shape but varying color is the unicode block elements ' ░▒▓█' (see e.g. wikipedia), but this only gives me 5 possible shades (space, 3 shades, 1 filled block). I use every character twice so the widhts is approximately the same as the height.
What other characters are suitable for drawing a heatmap in console?
See example code in python below. The question is of course applicable for other languages as well.
import numpy as np
def ascii_heatmap(matrix: np.ndarray, disp=True):
assert matrix.ndim <= 2
matrix = np.atleast_2d(matrix)
vmax = matrix.max()
vmin = matrix.min()
symbolrange= ' ░▒▓█'
symbol_index_matrix = (matrix - vmin) * (len(symbolrange)-1) / (vmax-vmin)
heatmap_rows = []
for row in symbol_index_matrix:
heatmap_rows.append("".join(map(lambda x: symbolrange[int(x)]*2, row)))
heatmap = "\n".join(heatmap_rows)
if disp==True:
print(heatmap)
return heatmap
#Examples with vector and matrix
ascii_heatmap(np.array([1,2,3,4,5]))
ascii_heatmap(np.arange(9).reshape((3,3)))
Use the turbo ramp (Python code) and true colour terminal output.
#!/usr/bin/env perl
my $step = 17;
for (my $r = 0; $r <= 255; $r += $step) {
for (my $g = 0; $g <= 255; $g += $step) {
for (my $b = 0; $b <= 255; $b += $step) {
print "\e[48;2;$r;$g;${b}m ";
# ↑ escape char ↑ coloured space char
}
}
}

Calculating Factorials using QBasic

I'm writing a program that calculates the Factorial of 5 numbers and output the results in a Tabular form but I keep getting Zeros.
Factorial Formula:. n! = n×(n-1)!
I tried:
CLS
DIM arr(5) AS INTEGER
FOR x = 1 TO 5
INPUT "Enter Factors: ", n
NEXT x
f = 1
FOR i = 1 TO arr(n)
f = f * i
NEXT i
PRINT
PRINT "The factorial of input numbers are:";
PRINT
FOR x = 1 TO n
PRINT f(x)
NEXT x
END
and I'm expecting:
Numbers Factorrials
5 120
3 6
6 720
8 40320
4 24
You did some mistakes
FOR i = 1 TO arr(n)
where is n defined
you also never stored actual values into arr
PRINT f(x)
here you take from array f that is also not defined in your code
Possible solution to calculate arrays of factorials:
CLS
DIM arr(5) AS INTEGER
DIM ans(5) AS LONG
FOR x = 1 TO 5
INPUT "Enter Factors: ", arr(x)
f& = 1
FOR i = 1 TO arr(x)
f& = f& * i
NEXT i
ans(x) = f&
NEXT x
PRINT
PRINT "The factorial of input numbers are:";
PRINT
PRINT "Numbers", "Factorials"
FOR x = 1 TO 5
PRINT arr(x), ans(x)
NEXT x
END
I don't have a BASIC interpreter right in front of me, but I think this is what you're looking for:
CLS
DIM arr(5) AS INTEGER
DIM ans(5) AS LONG 'You need a separate array to store results in.
FOR x = 1 TO 5
INPUT "Enter Factors: ", arr(x)
NEXT x
FOR x = 1 to 5
f& = 1
FOR i = 1 TO arr(x)
f& = f& * i
NEXT i
ans(x) = f&
NEXT x
PRINT
PRINT "The factorial of input numbers are:";
PRINT
PRINT "Numbers", "Factorials"
FOR x = 1 TO 5
PRINT STR$(arr(x)), ans(x)
NEXT x
END
Just a comment though: In programming, you should avoid reusing variables unless you are short on memory. It can be done right, but it creates many opportunities for hard to find bugs in larger programs.
Possible solution to calculate arrays of factorials and square roots:
CLS
PRINT "Number of values";: INPUT n
DIM arr(n) AS INTEGER
DIM ans(n) AS LONG
FOR x = 1 TO n
PRINT "Enter value"; x;: INPUT arr(x)
f& = 1
FOR i = 1 TO arr(x)
f& = f& * i
NEXT i
ans(x) = f&
NEXT x
PRINT
PRINT "The factorial/square root of input numbers are:";
PRINT
PRINT "Number", "Factorial", "Squareroot"
FOR x = 1 TO n
PRINT arr(x), ans(x), SQR(arr(x))
NEXT x
END

Randomly selecting letters by frequency of use

After feeding few Shakespeare books to my Perl script I have a hash with 26 english letters as keys and the number of their occurences in texts - as value:
%freq = (
a => 24645246,
b => 1409459,
....
z => 807451,
);
and of course the total number of all letters - let's say in the $total variable.
Is there please a nice trick to generate a string holding 16 random letters (a letter can occur several times there) - weighted by their frequency of use?
To be used in a word game similar to Ruzzle:
Something elegant - like picking a random line from a file, as suggested by a Perl Cookbook receipt:
rand($.) < 1 && ($line = $_) while <>;
The Perl Cookbook trick for picking a random line (which can also be found in perlfaq5) can be adapted for weighted sampling too:
my $chosen;
my $sum = 0;
foreach my $item (keys %freq) {
$sum += $freq{$item};
$chosen = $item if rand($sum) < $freq{$item};
}
Here, $sum corresponds to the line counter $. and $freq{$item} to the constant 1 in the Cookbook version.
If you're going to be picking a lot of weighted random samples, you can speed this up a bit with some preparation (note that this destroys %freq, so make a copy first if you want to keep it):
# first, scale all frequencies so that the average frequency is 1:
my $avg = 0;
$avg += $_ for values %freq;
$avg /= keys %freq;
$_ /= $avg for values %freq;
# now, prepare the array we'll need for fast weighted sampling:
my #lookup;
while (keys %freq) {
my ($lo, $hi) = (sort {$freq{$a} <=> $freq{$b}} keys %freq)[0, -1];
push #lookup, [$lo, $hi, $freq{$lo} + #lookup];
$freq{$hi} -= (1 - $freq{$lo});
delete $freq{$lo};
}
Now, to draw a random weighted sample from the prepared distribution, you just do this:
my $r = rand #lookup;
my ($lo, $hi, $threshold) = #{$lookup[$r]};
my $chosen = ($r < $threshold ? $lo : $hi);
(This is basically the Square Histogram method described in Marsaglia, Tsang & Wang (2004), "Fast Generation of Discrete Random Variables", J. Stat. Soft. 11(3) and originally due to A.J. Walker (1974).)
I have no clue about Perl syntax so I'll just write pseudo-code. You can do something like that
sum <= 0
foreach (letter in {a, z})
sum <= sum + freq[letter]
pick r, a random integer in [0, sum[
letter <= 'a' - 1
do
letter <= letter + 1
r <= r - freq(letter)
while r > 0
letter is the resulting value
The idea behind this code is to make a stack of boxes for each letter. The size of each box is the frequency of the letter. Then we choose a random location on this stack and see which letter's box we landed.
Example :
freq(a) = 5
freq(b) = 3
freq(c) = 3
sum = 11
| a | b | c |
- - - - - - - - - - -
When we choose a 0 <= r < 11, we have the following probabilities
Pick a 'a' = 5 / 11
Pick a 'b' = 3 / 11
Pick a 'c' = 3 / 11
Which is exactly what we want.
You can first built a table of the running sum of the frequency. So if you have the following data:
%freq = (
a => 15,
b => 25,
c => 30,
d => 20
);
the running sum would be;
%running_sums = (
a => 0,
b => 15,
c => 40, # 15 + 25
d => 70, # 15 + 25 + 30
);
$max_sum = 90; # 15 + 25 + 30 + 20
To pick a single letter with the weighted frequency, you need to select a number between [0,90), then you can do a linear search on the running_sum table for the range that includes the letter. For example, if your random number is 20 then the appropriate range is 15-40, which is for the letter 'b'. Using linear search gives a total running time of O(m*n) where m is the number of letters we need and n is the size of the alphabet (therefore m=16, n=26). This is essentially what #default locale do.
Instead of linear search, you can also do a binary search on the running_sum table to get the closest number rounded down. This gives a total running time of O(m*log(n)).
For picking m letters though, there is a faster way than O(m*log(n)), perticularly if n < m. First you generate m random numbers in sorted order (which can be done without sorting in O(n)) then you do a linear matching for the ranges between the list of sorted random numbers and the list of running sums. This gives a total runtime of O(m+n). The code in its entirety running in Ideone.
use List::Util qw(shuffle);
my %freq = (...);
# list of letters in sorted order, i.e. "a", "b", "c", ..., "x", "y", "z"
# sorting is O(n*log(n)) but it can be avoided if you already have
# a list of letters you're interested in using
my #letters = sort keys %freq;
# compute the running_sums table in O(n)
my $sum = 0;
my %running_sum;
for(#letters) {
$running_sum{$_} = $sum;
$sum += $freq{$_};
}
# generate a string with letters in $freq frequency in O(m)
my $curmax = 1;
my $curletter = $#letters;
my $i = 16; # the number of letters we want to generate
my #result;
while ($i > 0) {
# $curmax generates a uniformly distributed decreasing random number in [0,1)
# see http://repository.cmu.edu/cgi/viewcontent.cgi?article=3483&context=compsci
$curmax = $curmax * (1-rand())**(1. / $i);
# scale the random number $curmax to [0,$sum)
my $num = int ($curmax * $sum);
# find the range that includes $num
while ($num < $running_sum{$letters[$curletter]}) {
$curletter--;
}
push(#result, $letters[$curletter]);
$i--;
}
# since $result is sorted, you may want to use shuffle it first
# Fisher-Yates shuffle is O(m)
print "", join('', shuffle(#result));

How many random strings does this code generate?

I am considering this random string generator in perl:
sub generate_random_string {
my $length = 12;
my #chars = qw/2 3 4 5 6 7 8 9 A B C D E F G H J K M N P Q R S T U V W X Y Z/;
my $str = '';
$str .= $chars[int rand #chars] for 1..$length;
return $str;
}
How many unique strings will this generate? If I extend the length of the string, how many more unique strings are available?
Also, how do I calculate the probability of generating the same string twice (assuming the length of the string stays at 12)?
The answer is: (1/31) ^ 12
Or more generically: (1/(number of characters)) ^ length

How can I create combinations of several lists without hardcoding loops?

I have data that looks like this:
my #homopol = (
["T","C","CC","G"], # part1
["T","TT","C","G","A"], #part2
["C","CCC","G"], #part3 ...upto part K=~50
);
my #prob = ([1.00,0.63,0.002,1.00,0.83],
[0.72,0.03,1.00, 0.85,1.00],
[1.00,0.97,0.02]);
# Note also that the dimension of #homopol is always exactly the same with #prob.
# Although number of elements can differ from 'part' to 'part'.
What I want to do is to
Generate all combinations of elements in part1 through out partK
Find the product of the corresponding elements in #prob.
Hence at the end we hope to get this output:
T-T-C 1 x 0.72 x 1 = 0.720
T-T-CCC 1 x 0.72 x 0.97 = 0.698
T-T-G 1 x 0.72 x 0.02 = 0.014
...
G-G-G 1 x 0.85 x 0.02 = 0.017
G-A-C 1 x 1 x 1 = 1.000
G-A-CCC 1 x 1 x 0.97 = 0.970
G-A-G 1 x 1 x 0.02 = 0.020
The problem is that the following code of mine does that by hardcoding
the loops. Since the number of parts of #homopol is can be varied and large
(e.g. ~K=50), we need a flexible and compact way to get the same result. Is there any?
I was thinking to use Algorithm::Loops, but not sure how to achieve that.
use strict;
use Data::Dumper;
use Carp;
my #homopol = (["T","C","CC","G"],
["T","TT","C","G","A"],
["C","CCC","G"]);
my #prob = ([1.00,0.63,0.002,1.00,0.83],
[0.72,0.03,1.00, 0.85,1.00],
[1.00,0.97,0.02]);
my $i_of_part1 = -1;
foreach my $base_part1 ( #{ $homopol[0] } ) {
$i_of_part1++;
my $probpart1 = $prob[0]->[$i_of_part1];
my $i_of_part2 =-1;
foreach my $base_part2 ( #{ $homopol[1] } ) {
$i_of_part2++;
my $probpart2 = $prob[1]->[$i_of_part2];
my $i_of_part3 = -1;
foreach my $base_part3 ( #{ $homopol[2] } ) {
$i_of_part3++;
my $probpart3 = $prob[2]->[$i_of_part3];
my $nstr = $base_part1."".$base_part2."".$base_part3;
my $prob_prod = sprintf("%.3f",$probpart1 * $probpart2 *$probpart3);
print "$base_part1-$base_part2-$base_part3 \t";
print "$probpart1 x $probpart2 x $probpart3 = $prob_prod\n";
}
}
}
I would recommend Set::CrossProduct, which will create an iterator to yield the cross product of all of your sets. Because it uses an iterator, it does not need to generate every combination in advance; rather, it yields each one on demand.
use strict;
use warnings;
use Set::CrossProduct;
my #homopol = (
[qw(T C CC G)],
[qw(T TT C G A)],
[qw(C CCC G)],
);
my #prob = (
[1.00,0.63,0.002,1.00],
[0.72,0.03,1.00, 0.85,1.00],
[1.00,0.97,0.02],
);
# Prepare by storing the data in a list of lists of pairs.
my #combined;
for my $i (0 .. $#homopol){
push #combined, [];
push #{$combined[-1]}, [$homopol[$i][$_], $prob[$i][$_]]
for 0 .. #{$homopol[$i]} - 1;
};
my $iterator = Set::CrossProduct->new([ #combined ]);
while( my $tuple = $iterator->get ){
my #h = map { $_->[0] } #$tuple;
my #p = map { $_->[1] } #$tuple;
my $product = 1;
$product *= $_ for #p;
print join('-', #h), ' ', join(' x ', #p), ' = ', $product, "\n";
}
A solution using Algorithm::Loops without changing the input data would look something like:
use Algorithm::Loops;
# Turns ([a, b, c], [d, e], ...) into ([0, 1, 2], [0, 1], ...)
my #lists_of_indices = map { [ 0 .. #$_ ] } #homopol;
NestedLoops( [ #lists_of_indices ], sub {
my #indices = #_;
my $prob_prod = 1; # Multiplicative identity
my #base_string;
my #prob_string;
for my $n (0 .. $#indices) {
push #base_string, $hompol[$n][ $indices[$n] ];
push #prob_string, sprintf("%.3f", $prob[$n][ $indices[$n] ]);
$prob_prod *= $prob[$n][ $indices[$n] ];
}
print join "-", #base_string; print "\t";
print join "x", #prob_string; print " = ";
printf "%.3f\n", $prob_prod;
});
But I think that you could actually make the code clearer by changing the structure to one more like
[
{ T => 1.00, C => 0.63, CC => 0.002, G => 0.83 },
{ T => 0.72, TT => 0.03, ... },
...
]
because without the parallel data structures you can simply iterate over the available base sequences, instead of iterating over indices and then looking up those indices in two different places.
Why don't you use recursion? Pass the depth as a parameter and let the function call itself with depth+1 inside the loop.
you could do it by creating an array of indicies the same length as the #homopol array (N say), to keep track of which combination you are looking at. In fact this array is just like a
number in base N, with the elements being the digits. Iterate in the same way as you would write down consectutive numbers in base N, e.g (0 0 0 ... 0), (0 0 0 ... 1), ...,(0 0 0 ... N-1), (0 0 0 ... 1 0), ....
Approach 1: Calculation from indices
Compute the product of lengths in homopol (length1 * length2 * ... * lengthN). Then, iterate i from zero to the product. Now, the indices you want are i % length1, (i / length1)%length2, (i / length1 / length2) % length3, ...
Approach 2: Recursion
I got beaten to it, see nikie's answer. :-)