calculate co-occurrences - perl

I have a file as shown in the screenshot attached. There are 61 events(peaks) and I want to find how often each peak occurs with the other (co-occurrence) for all possible combinations. The file has the frequency (number of times the peak appears in the 47 samples) and probability (no.of times the peak occurs divided by total number of samples).
Then I want to find mutually exclusive peaks using the formula p(x,y) / p(x)*p(y), where p(x,y) is the probability that x and y co-occur, p(x) is probability of peak (x) and p(y) is the probability of peak y.
What is the best way to solve such a problem? Do I need to write a Perl script or are there some R functions I could use? I am a biologist trying to learn Perl and R so I would appreciate some example code to solve this problem.

In the following, I've assumed that what you alternately call p(xy) and p(x,y) should actually be the probability (rather than the number of times) that x and y co-occur. If that's not correct, just remove the division by nrow(X) from the 2nd line below.
# As an example, create a sub-matrix of your data
X <- cbind(c(0,0,0,0,0,0), c(1,0,0,1,1,1), c(1,1,0,0,0,0))
num <- (t(X) %*% X)/nrow(X) # The numerator of your expression
means <- colMeans(X) # A vector of means of each column
denom <- outer(colMeans(X), colMeans(X)) # The denominator
out <- num/denom
# [,1] [,2] [,3]
# [1,] NaN NaN NaN
# [2,] NaN 1.50 0.75
# [3,] NaN 0.75 3.00
Note: The NaNs in the results are R's way of indicating that those cells are "Not a number" (since they are each the result of dividing 0 by 0).

your question is not completely clear without a proper example but I am thinking this result is along the lines of what you want i.e. "I want to find how often each peak occurs with the other (co-occurrence) "
library(igraph)
library(tnet)
library(bipartite)
#if you load your data in as a matrix e.g.
mat<-matrix(c(1,1,0,2,2,2,3,3,3,4,4,0),nrow=4,byrow=TRUE) # e.g.
# [,1] [,2] [,3] # your top line as columns e.g.81_05 131_00 and peaks as rows
#[1,] 1 1 0
#[2,] 2 2 2
#[3,] 3 3 3
#[4,] 4 4 0
then
pairs<-web2edges(mat,return=TRUE)
pairs<- as.tnet(pairs,type="weighted two-mode tnet")
peaktopeak<-projecting_tm(pairs, method="sum")
peaktopeak
#peaktopeak
# i j w
#1 1 2 2 # top row here says peak1 and peak2 occurred together twice
#2 1 3 2
#3 1 4 2
#4 2 1 4
#5 2 3 6
#6 2 4 4
#7 3 1 6
#8 3 2 9
#9 3 4 6
#10 4 1 8
#11 4 2 8
#12 4 3 8 # peak4 occured with peak3 8 times
EDIT: If mutually exclusive peaks that do not occur are just those that do not share 1s in the same columns as your original data then you can just see this in peaktopeak. For instance if peak 1 and 3 never occur they wont be found in peaktopeak in the same row.
To look at this easier you could:
peakmat <- tnet_igraph(peaktopeak,type="weighted one-mode tnet")
peakmat<-get.adjacency(peakmat,attr="weight")
e.g.:
# [,1] [,2] [,3] [,4]
#[1,] 0 2 2 2
#[2,] 4 0 6 4
#[3,] 6 9 0 6
#[4,] 8 8 8 0 # zeros would represent peaks that never co occur.
#In this case everything shares at least 2 co-occurrences
#diagonals are 0 as saying peak1 occurs with itself is obviously silly.

Related

How to increment this vector: v = [1ˆ2, 3ˆ2, 5ˆ2, ..., (2n+1)ˆ2] with given pattern (in description)

I am lost as to how to increment this vector. I know that the value from each number squared increase by odd numbers starting from 3. From 1^2 to 2^2 we have a space of three, from 2^2 to 3^2 we have a space of 5 in between and then 7 in between for 3^2 to 4^2 and 9 in between 4^2 and 5^2 and so on and so forth. But I just can't think of how I would write those increments for a general case as I have to do in this given problem.
You cannot define d in the a:d:b vector with one formula because it changes constantly. Therefore, you need to define your vector as [1 3 5 7 ... 2n+1] and square it.
(1:2:2*n+1).^2
ans =
1 9 25 49 81 121

Choose multiple combination (matlab)

I have 3 sets of array each contains 12 elements of same type
a=[1 1 1 1 1 1 1 1 1 1 1];
b=[2 2 2 2 2 2 2 2 2 2 2];
c=[3 3 3 3 3 3 3 3 3 3 3];
I have to find how many ways it can be picked up if I need to pickup 12 items at a time
here, 1 1 2 is same as 2 1 1
I found this link Generating all combinations with repetition using MATLAB.
Ho this can be done in matlab within reasonable time.
is that way correct
abc=[a b c];
allcombs=nmultichoosek(abc,12);
combs=unique(allcombs,'rows');
If you only need to find the number of ways to select the items, then using generating functions is a way to very efficiently compute that, even for fairly large values of N and k.
If you are not familiar with generating functions, you can read up on the math background here:
http://mathworld.wolfram.com/GeneratingFunction.html
and here:
http://math.arizona.edu/~faris/combinatoricsweb/generate.pdf
The solution hinges on the fact that the number of ways to choose k items from 36, with each of 3 items repeated 12 times, can be determined from the product of the generating function:
g(x) = 1 + x + x^2 + x^3 + ... + x^12
with itself 3 times. The 12 comes from the fact the elements are repeated 12 times (NOT from the fact you are choosing 12), and multiplying by itself 3 times is because there are three different sets of elements. The number of ways to choose 12 elements is then just the coefficient of the power of x^12 in this product of polynomials (try it for smaller examples if you want to prove to yourself that it works).
The great thing about that is that MATLAB has a simple function conv for multiplying polynomials:
>> g = ones(1,13); %% array of 13 ones, for a 12th degree polynomial with all `1` coefficents
>> prod = conv(g, conv(g, g)); %% multiply g by itself 3 times, as a polynomial
>> prod(13)
ans =
91
So there are 91 ways to select 12 elements from your list of 36. If you want to select 11 elements, that's z(12) = 78. If you want to select 13 elements, that's z(14) = 102.
Finally, if you had different numbers of elements in the sets, say 10 1's, 12 2's, and 14 3's, then you would have 3 distinct polynomials of the same form, 1 + x + x^2 + ..., with degrees 10, 12 and 14 respectively. Inspecting the coefficient of the degree k term again gives you the number of ways to choose k elements from this set.

MATLAB: sample from population randomly many times?

I am aware of MATLAB's datasample which allows to select k times from a certain population. Suppose population=[1,2,3,4] and I want to uniformly sample, with replacement, k=5 times from it. Then:
datasample(population,k)
ans =
1 3 2 4 1
Now, I want to repeat the above experiment N=10000 times without using a for loop. I tried doing:
datasample(repmat(population,N,1),5,2)
But the output I get is (just a short excerpt below):
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
Every row (result of an experiment) is the same! But obviously they should be different... It's as though some random seed is not updating between rows. How can I fix this? Or some other method I could use that avoids a for loop? Thanks!
You seem to be confusing the way datasample works. If you read the documentation on the function, if you specify a matrix, it will generate a data sampling from a selection of rows in the matrix. Therefore, if you simply repeat the population vector 10000 times, and when you specify the second parameter of the function - which in this case is how many rows of the matrix to extract, even though the actual row locations themselves are different, the actual rows over all of the matrix is going to be the same which is why you are getting that "error".
As such, I wouldn't use datasample here if it is your intention to avoid looping. You can use datasample, but you'd have to loop over each call and you explicitly said that this is not what you want.
What I would recommend you do is first create your population vector to have whatever you desire in it, then generate a random index matrix where each value is between 1 up to as many elements as there are in population. This matrix is in such a way where the number of columns is the number of samples and the number of rows is the number of trials. Once you create this matrix, simply use this to index into your vector to achieve the desired sampling matrix. To generate this random index matrix, randi is a fine choice.
Something like this comes to mind:
N = 10000; %// Number of trials
M = 5; %// Number of samples per trial
population = 1:4; %// Population vector
%// Generate random indices
ind = randi(numel(population), N, M);
%// Get the stuff
out = population(ind);
Here's the first 10 rows of the output:
>> out(1:10,:)
ans =
4 3 1 4 2
4 4 1 3 4
3 2 2 2 3
1 4 2 2 2
1 2 3 4 2
2 2 3 2 1
4 1 3 2 4
1 4 1 3 1
1 1 2 4 4
1 2 4 2 1
I think the above does what you want. Also keep in mind that the above code generalizes to any population vector you want. You simply have to change the vector and it will work as advertised.
datasample interprets each column of your data as one element of your population, sampling among all columns.
To fix this you could call datasample N times in a loop, instead I would use randi
population(randi(numel(population),N,5))
assuming your population is always 1:p, you could simplify to:
randi(p,N,5)
Ok so both of the current answers both say don't use datasample and use randi instead. However, I have a solution for you with datasample and arrayfun.
>> population = [1 2 3 4];
>> k = 5; % Number of samples
>> n = 1000; % Number of times to execute datasample(population, k)
>> s = arrayfun(#(k) datasample(population, k), n*ones(k, 1), 'UniformOutput', false);
>> s = cell2mat(s);
s =
1 4 1 4 4
4 1 2 2 4
2 4 1 2 1
1 4 3 3 1
4 3 2 3 2
We need to make sure to use 'UniformOutput', false with arrayfun as there is more than one output. The cell2mat call is needed as the result of arrayfun is a cell array.

How to count patterns columnwise in Matlab?

I have a matrix S in Matlab that looks like the following:
2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1
I would like to count patterns of values column-wise. I am interested into the frequency of the numbers that follow right after number 3 in any of the columns. For instance, number 3 occurs three times in the first column. The first time we observe it, it is followed by 3, the second time it is followed by 3 again and the third time it is followed by 4. Thus, the frequency for the patters observed in the first column would look like:
3-3: 66.66%
3-4: 33.33%
3-1: 0%
3-2: 0%
To generate the output, you could use the convenient tabulate
S = [
2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1];
idx = find(S(1:end-1,:)==3);
S2 = S(2:end,:);
tabulate(S2(idx))
Value Count Percent
1 0 0.00%
2 0 0.00%
3 4 66.67%
4 2 33.33%
Here's one approach, finding the 3's then looking at the following digits
[i,j]=find(S==3);
k=i+1<=size(S,1);
T=S(sub2ind(size(S),i(k)+1,j(k))) %// the elements of S that are just below a 3
R=arrayfun(#(x) sum(T==x)./sum(k),1:max(S(:))).' %// get the number of probability of each digit
I'm going to restate your problem statement in a way that I can understand and my solution will reflect this new problem statement.
For a particular column, locate the locations that contain the number 3.
Look at the row immediately below these locations and look at the values at these locations
Take these values and tally up the total number of occurrences found.
Repeat these for all of the columns and update the tally, then determine the percentage of occurrences for the values.
We can do this by the following:
A = [2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1]; %// Define your matrix
[row,col] = find(A(1:end-1,:) == 3);
vals = A(sub2ind(size(A), row+1, col));
h = 100*accumarray(vals, 1) / numel(vals)
h =
0
0
66.6667
33.3333
Let's go through the above code slowly. The first few lines define your example matrix A. Next, we take a look at all of the rows except for the last row of your matrix and see where the number 3 is located with find. We skip the last row because we want to be sure we are within the bounds of your matrix. If there is a number 3 located at the last row, we would have undefined behaviour if we tried to check the values below the last because there's nothing there!
Once we do this, we take a look at those values in the matrix that are 1 row beneath those that have the number 3. We use sub2ind to help us facilitate this. Next, we use these values and tally them up using accumarray then normalize them by the total sum of the tallying into percentages.
The result would be a 4 element array that displays the percentages encountered per number.
To double check, if we look at the matrix, we see that the value of 3 follows other values of 3 for a total of 4 times - first column, row 3, row 4, second column, row 2 and third column, row 6. The value of 4 follows the value of 3 two times: first column, row 6, second column, row 3.
In total, we have 6 numbers we counted, and so dividing by 6 gives us 4/6 or 66.67% for number 3 and 2/6 or 33.33% for number 4.
If I got the problem statement correctly, you could efficiently implement this with MATLAB's logical indexing and an approach that is essentially of two lines -
%// Input 2D matrix
S = [
2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1]
Labels = [1:4]'; %//'# Label array
counts = histc(S([false(1,size(S,2)) ; S(1:end-1,:) == 3]),Labels)
Percentages = 100*counts./sum(counts)
Verify/Present results
The styles for presenting the output results listed next use MATLAB's table for a well human-readable format of data.
Style #1
>> table(Labels,Percentages)
ans =
Labels Percentages
______ ___________
1 0
2 0
3 66.667
4 33.333
Style #2
You can do some fancy string operations to present the results in a more "representative" manner -
>> Labels_3 = strcat('3-',cellstr(num2str(Labels','%1d')'));
>> table(Labels_3,Percentages)
ans =
Labels_3 Percentages
________ ___________
'3-1' 0
'3-2' 0
'3-3' 66.667
'3-4' 33.333
Style #3
If you want to present them in descending sorted manner based on the percentages as listed in the expected output section of the question, you can do so with an additional step using sort -
>> [Percentages,idx] = sort(Percentages,'descend');
>> Labels_3 = strcat('3-',cellstr(num2str(Labels(idx)','%1d')'));
>> table(Labels_3,Percentages)
ans =
Labels_3 Percentages
________ ___________
'3-3' 66.667
'3-4' 33.333
'3-1' 0
'3-2' 0
Bonus Stuff: Finding frequency (counts) for all cases
Now, let's suppose you would like repeat this process for say 1, 2 and 4 as well, i.e. find occurrences after 1, 2 and 4 respectively. In that case, you can iterate the above steps for all cases and for the same you can use arrayfun -
%// Get counts
C = cell2mat(arrayfun(#(n) histc(S([false(1,size(S,2)) ; S(1:end-1,:) == n]),...
1:4),1:4,'Uni',0))
%// Get percentages
Percentages = 100*bsxfun(#rdivide, C, sum(C,1))
Giving us -
Percentages =
90.9091 20.0000 0 100.0000
9.0909 20.0000 0 0
0 60.0000 66.6667 0
0 0 33.3333 0
Thus, in Percentages, the first column are the counts of [1,2,3,4] that occur right after there is a 1 somewhere in the input matrix. As as an example, one can see column -3 of Percentages is what you had in the sample output when looking for elements right after 3 in the input matrix.
If you want to compute frequencies independently for each column:
S = [2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1]; %// data: matrix
N = 3; %// data: number
r = max(S(:));
[R, C] = size(S);
[ii, jj] = find(S(1:end-1,:)==N); %// step 1
count = full(sparse(S(ii+1+(jj-1)*R), jj, 1, r, C)); %// step 2
result = bsxfun(#rdivide, count, sum(S(1:end-1,:)==N)); %// step 3
This works as follows:
find is first applied to determine row and col indices of occurrences of N in S except its last row.
The values in the entries right below the indices of step 1 are accumulated for each column, in variable count. The very convenient sparse function is used for this purpose. Note that this uses linear indexing into S.
To obtain the frequencies for each column, count is divided (with bsxfun) by the number of occurrences of N in each column.
The result in this example is
result =
0 0 0 NaN
0 0 0 NaN
0.6667 0.5000 1.0000 NaN
0.3333 0.5000 0 NaN
Note that the last column correctly contains NaNs because the frequency of the sought patterns is undefined for that column.

Find the increasing and decreasing trend in a curve MATLAB

a=[2 3 6 7 2 1 0.01 6 8 10 12 15 18 9 6 5 4 2].
Here is an array i need to extract the exact values where the increasing and decreasing trend starts.
the output for the array a will be [2(first element) 2 6 9]
a=[2 3 6 7 2 1 0.01 6 8 10 12 15 18 9 6 5 4 2].
^ ^ ^ ^
| | | |
Kindly help me to get the result in MATLAB for any similar type of array..
You just have to find where the sign of the difference between consecutive numbers changes.
With some common sense and the functions diff, sign and find, you get this solution:
a = [2 3 6 7 2 1 0.01 6 8 10 12 15 18 9 6 5 4 2];
sda = sign(diff(a));
idx = [1 find(sda(1:end-1)~=sda(2:end))+2 ];
result = a(idx);
EDIT:
The sign function messes things up when there are two consecutive numbers which are the same, because sign(0) = 0, which is falsely identified as a trend change. You'd have to filter these out. You can do this by first removing the consecutive duplicates from the original data. Since you only want the values where the trend change starts, and not the position where it actually starts, this is easiest:
a(diff(a)==0) = [];
This is a great place to use the diff function.
Your first step will be to do the following:
B = [0 diff(a)]
The reason we add the 0 there is to keep the matrix the same length because of the way the diff function works. It will start with the first element in the matrix and then report the difference between that and the next element. There's no leading element before the first one so is just truncates the matrix by one element. We add a zero because there is no change there as it's the starting element.
If you look at the results in B now it is quite obvious where the inflection points are (where you go from positive to negative numbers).
To pull this out programatically there are a number of things you can do. I tend to use a little multiplication and the find command.
Result = find(B(1:end-1).*B(2:end)<0)
This will return the index where you are on the cusp of the inflection. In this case it will be:
ans =
4 7 13