rewrite the seqneighjoin function in matlab - matlab

I have the task to rewrite the seqneighjoin function in matlab by adding the frequency of all the sequences. After searching, I understand that this function returns a phylogenetic tree object obtained by seqences neighbor joinn method from the wiki http://en.wikipedia.org/wiki/Neighbor_joining
Now, I have the following two questions.
(1): what is the data structure of this phytree object obtained by this function? How to express it? For example, for the similar linkage function, it also returns a phylogenetic tree, and the data structure is very clear there, i.e., it is a matrix with three columns, where the i-th column indicates which nodes are combined and the corresponding distance when they are combined. Thanks very much for your time and attention.
(2): Based on wiki, how am I supposed to add frequency to the function seqneighjoin? I am totally confused.
Thanks so much for your time and attention. I truly appreciate that.
EDIT: the following is the code.
function z = seqneighjoin(D_all, freq)
n = size (D_all, 2);
m=(1+sqrt(8*n+1))/2;
z=zeros(m-1,3);
q=zeros(m,m);
str = zeros (m,m);
% initialize the distance matrix d
d=ones(m,m);
d(tril(d,-1)==1)=D_all;
d(triu(d,1)==1)=D_all;
d(eye(m,m)==1) = 1:m; % the diagonal entries of the matrix d is the indices of the clusters
% initialize the matrix str
for r=1:m
for c=1:m
str(r,c)=freq(r)*freq(c)*d(r,c);
str(c,r)=freq(r)*freq(c);
end
end
% loop through for m-1 times to create the matrix z
for k = 1:m-1
% initialize (for the first time) or update (for all other times)
% the matrix q
colSum = sum(d, 1);
rowSum=sum(d,2);
a=size(colSum, 2);
colSumM=colSum(ones(a,1),:);
rowSunM=rowSum(:,ones(1,a));
q=(a-2)*d-colSumM-rowSumM;
% find the minimum element in the matrix q
u=min(q);
v=min(u);
[i,j]=find(q==v);
r=i(1);
c=j(1);
% combine d(r,r) and d(c,c) to get a new node m+k
z(k,:)=[d(r,r), d(c,c), v];
% calculate the distance between the new node m+k and all other node
% which are not m+k
d(r,:) = (d(r,:) + d(c,:) - d(r,c) )/2;
d(r,r) = m+k;
d(c,:)=[]; d(:,c)=[];
end
Here, D_all is the vector representation of a distance matrix returned by the seqpdist function in matlab, and freq is the vector indicating the frequency of all the sequences.

Related

Vectorize function that finds an array of nearest values

I am still wrapping my head around vectorization and I'm having a difficult time trying to resolve the following function I made...
for i = 1:size(X, 1)
min_n = inf;
for j=1:K
val = X(i,:)' - centroids(j,:)';
diff = val'*val;
if (diff < min_n)
idx(i) = j;
min_n = diff;
end
end
end
X is an array of (x,y) coordinates...
2 5
5 6
...
...
centroids in this example is limited to 3 rows. It is also in (x,y) format as shown above.
For every pair in X I am computing the closest pair of centroids. I then store the index of the centroid in idx.
So idx(i) = j means that I am storing the index j of the centroid at index i, where i corresponds to the index of X. This means the closest centroid to pair X(i, :) is at idx(i).
Can I possibly simplify this via vectorization? I struggle with just vectorizing the inner loop.
Here are three options. But please note that the disadvantage of vectorization, as compared to your double loops, is that it stores all the difference operation results at once, which means that if your matrices have many rows, you might run out of memory. On the other hand, the vectorized approach is probably much faster.
Option 1
If you have access to Statistics and Machine Learning Toolbox, you can use the function pdist2 to get all the pairwise distances between rows of two matrices. Then, the min function gives you the minimum of each column of the result. Its first returned value are the minimal values, and its second are the indices, which is what you need for idx:
diff = pdist2(centroids,X);
[~,idx] = min(diff);
Option 2
If you don't have access to the toolbox, you can use bsxfun. This will let you compute the difference operation between the two matrices even if their dimensions don't agree. All you need to do is to use shiftdim to reshape X' to have size [1,size(X,2),size(X,1)], and then reshapedX and and centroids are compatible with their dimensions (see documentation of bsxfun). This lets you take the difference between their values. The result is a three dimensional array, which you need to sum along the second dimension to get the norm of the differences between rows. At this point you can proceed as in option 1.
reshapedX = shiftdim(X',-1);
diff = bsxfun(#minus,centroids,reshapedX);
diff = squeeze(sum(diff.^2,2));
[~,idx] = min(diff);
Note: Starting in the Matlab version 2016b, the bsxfun is used implicitly and you do not need to call it anymore. So the line with bsxfun can be replaced with the simpler line diff = centroids-reshapedX.
Option 3
Use the function dsearchn, which performs exactly what you need:
idx = dsearchn(centroids,X);
it could be done using pdist2 - pairwise distances between rows of two matrices:
% random data
X = rand(500,2);
centroids = rand(3,2);
% pairwise distances
D = pdist2(X,centroids);
% closest centroid index for each X coordinates
[~,idx] = min(D,[],2)
% plot
scatter(centroids(:,1),centroids(:,2),300,(1:size(centroids,1))','filled');
hold on;
scatter(X(:,1),X(:,2),30,idx);
legend('Centroids','data');

Understanding PCA in MATLAB

What are the difference between the following two functions?
prepTransform.m
function [mu trmx] = prepTransform(tvec, comp_count)
% Computes transformation matrix to PCA space
% tvec - training set (one row represents one sample)
% comp_count - count of principal components in the final space
% mu - mean value of the training set
% trmx - transformation matrix to comp_count-dimensional PCA space
% this is memory-hungry version
% commented out is the version proper for Win32 environment
tic;
mu = mean(tvec);
cmx = cov(tvec);
%cmx = zeros(size(tvec,2));
%f1 = zeros(size(tvec,1), 1);
%f2 = zeros(size(tvec,1), 1);
%for i=1:size(tvec,2)
% f1(:,1) = tvec(:,i) - repmat(mu(i), size(tvec,1), 1);
% cmx(i, i) = f1' * f1;
% for j=i+1:size(tvec,2)
% f2(:,1) = tvec(:,j) - repmat(mu(j), size(tvec,1), 1);
% cmx(i, j) = f1' * f2;
% cmx(j, i) = cmx(i, j);
% end
%end
%cmx = cmx / (size(tvec,1)-1);
toc
[evec eval] = eig(cmx);
eval = sum(eval);
[eval evid] = sort(eval, 'descend');
evec = evec(:, evid(1:size(eval,2)));
% save 'nist_mu.mat' mu
% save 'nist_cov.mat' evec
trmx = evec(:, 1:comp_count);
pcaTransform.m
function [pcaSet] = pcaTransform(tvec, mu, trmx)
% tvec - matrix containing vectors to be transformed
% mu - mean value of the training set
% trmx - pca transformation matrix
% pcaSet - output set transforrmed to PCA space
pcaSet = tvec - repmat(mu, size(tvec,1), 1);
%pcaSet = zeros(size(tvec));
%for i=1:size(tvec,1)
% pcaSet(i,:) = tvec(i,:) - mu;
%end
pcaSet = pcaSet * trmx;
Which one is actually doing PCA?
If one is doing PCA, what is the other one doing?
The first function prepTransform is actually doing the PCA on your training data where you are determining the new axes to represent your data onto a lower dimensional space. What it does is that it finds the eigenvectors of the covariance matrix of your data and then orders the eigenvectors such that the eigenvector with the largest eigenvalue appears in the first column of the eigenvector matrix evec and the eigenvector with the smallest eigenvalue appears in the last column. What's important with this function is that you can define how many dimensions you want to reduce the data down to by keeping the first N columns of evec which will allow you to reduce your data down to N dimensions. The discarding of the other columns and keeping only the first N is what is set as trmx in the code. The variable N is defined by the prep_count variable in prepTransform function.
The second function pcaTransform finally transforms data that is defined within the same domain as your training data but not necessarily the training data itself (it could be if you wish) onto the lower dimensional space that is defined by the eigenvectors of the covariance matrix. To finally perform the reduction of dimensions, or dimensionality reduction as it is popularly known, you simply take your training data where each feature is subtracted from its mean and you multiply your training data by the matrix trmx. Note that prepTransform outputting the mean of each feature in the vector mu is important in order to mean subtract your data when you finally call pcaTransform.
How to use these functions
To use these functions effectively, first determine the trmx matrix, which contain the principal components of your data by first defining how many dimensions you want to reduce your data down to as well as the mean of each feature stored in mu:
N = 2; % Reduce down to two dimensions for example
[mu, trmx] = prepTransform(tvec, N);
Next you can finally perform dimensionality reduction on your data that is defined within the same domain as tvec (or even tvec if you wish, but it doesn't have to be) by:
pcaSet = pcaTransform(tvec, mu, trmx);
In terms of vocabulary, pcaSet contain what are known as the principal scores of your data, which is the term used for the transformation of your data to the lower dimensional space.
If I can recommend something...
Finding PCA through the eigenvector approach is known to be unstable. I highly recommend you use the Singular Value Decomposition via svd on the covariance matrix where the V matrix of the result already gives you the eigenvectors sorted which correspond to your principal components:
mu = mean(tvec, 1);
[~,~,V] = svd(cov(tvec));
Then perform the transformation by taking the mean subtracted data per feature and multiplying by the V matrix, once you subset and grab the first N columns of V:
N = 2;
X = bsxfun(#minus, tvec, mu);
pcaSet = X*V(:, 1:N);
X is the mean subtracted data which performs the same thing as doing pcaSet = tvec - repmat(mu, size(tvec,1), 1);, but you are not explicitly replicating the mean vector over each training example but letting bsxfun do that for you internally. However, taking advantage of MATLAB R2016b, this repeating can be done without the explicit call to bsxfun:
X = tvec - mu;
Further Reading
If you fully want to understand the code that was written and the theory behind what it's doing, I recommend the following two Stack Overflow posts that I have written that talk about the topic:
What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?
How to use eigenvectors obtained through PCA to reproject my data?
The first post brings the code you presented into light which performs PCA using the eigenvector approach. The second post touches base on how you'd do it using the SVD towards the end of the answer. This answer I've written here is a mix between the two posts above.

select optimal column vector from a matrix subject to a localised goal vector constraint

How to automatically select the column vector of a matrix within which the scalar values in a subset of elements is closest to those in a predefined goal vector of the same sub set?
I solved the problem and tested method on 100,10 matrix, it works - should also work for larger matrices, while hopefully not becoming too computationally expensive
%% Selection of optimal source function
% Now need to select the best source function in the data matrix
% k = 1,2,...n within which scalar values of a random set of elements are
% closest to a pre-defined goal vector with the same random set
% Proposed Method:
% Project the columns of the data matrix onto the goal vector
% Calculate the projection error vector matrix; the null space of the
% local goal vector, is orthogonal to its row space
% The column holding the minimum error vector is the optimal column
% [1] find the null space of the goal vector, containing the projection
% errors
mpg = pinv(gloc);
xstar = mpg*A;
p = gloc*xstar;
nA = A-p;
% [2] the minimum error vector will correspond to the optimal source
% function
normnA = zeros(1,n);
for i = 1:n
normnA(i) = norm(nA(:,i));
end
minnA = min(normnA);
[row,k] = find(normnA == minnA);
disp('The optimal source function is: ')
disp(k)

How to find N values of 3D matrix that satisfy condition

I have a 3D array that is denoted by features. Each element of feature is a number x. Now I will get that number and calculate g(x) and f(x) of the number (g and f are functions of x). My problem is how to get N maximization of absolute value between g(x) and f(x). The function will return an array with N elements x. But I don't know how to get them. Could you help me?
This is my code:
%features is 3D array
%N is elements that we need
%Fs,sigmas,thetas are size of the array
% return N elements of features that maximization abs(f_s-g_s)
function features_filter=gabor_sort(features,N,Fs,sigmas,thetas)
for k = 1:numel(sigmas)
for j = 1:numel(Fs)
for i = 1:numel(thetas)
x= features(:,:,k,j,i);
f_x=x.^2;
g_x=x.^3+1;
s1=abs(f_x-g_x);
%%Do something in here to get maximization of s1
end
end
end
end
This isn't a problem. Create two matrices that will store the features we get for each combination of sigma, Fs and theta, as well as place your absolute values for each feature in this matrix, and when you're done, sort these distances in descending order. We can then use the second parameter of sort to give us the location of the features that maximize this distance. In other words, do this:
%features is 3D array
%N is elements that we need
%Fs,sigmas,thetas are size of the array
% return N elements of features that maximization abs(f_x-g_x)
function features_filter=gabor_sort(features,N,Fs,sigmas,thetas)
s1 = []; % s1 array to store our distances
xFeatures = []; %// Features to return
for k = 1:numel(sigmas)
for j = 1:numel(Fs)
for i = 1:numel(thetas)
x = features(:,:,k,j,i);
xFeatures = cat(3,xFeatures,x); %// Stack features in a 3D matrix
x = x(:); %// Convert to 1D as per your comments
f_x=mean(x.^2); %// Per your comment
g_x=mean(x.^3+1); %// Per your comment
s1 = [s1 abs(f_x-g_x)]; %// Add to s1 array
end
end
end
[~,sortInd] = sort(s1, 'descend');
%// Return a 3D matrix where each slice is a feature matrix
%// The first slice is the one that maximized abs(f_x - g_x) the most
%// The second slice is the one that maximized abs(f_x - g_x) the second most, etc.
features_filter = xFeatures(:,:,sortInd(1:N));
Minor note: This code is untested. I don't have access to your data, so I can't really reproduce. Hope this works!

Matlab vectorizing equations and matrix multiplication

I have a program that currently uses a for loop to iterate through a set of functions. I've tried using parfor but that only works on the university's version of Matlab. I'd like to vectorize the handling of this so that a for loop isn't necessary. The equations I'm using basically call different types of Bessel functions and are contained in separate functions.
Here's what I'm trying to do:
For each value of m, build a vector of matrix elements for each required matrix. Then build each full matrix. I think this is working correctly.
Where it's throwing an error is on the final matrix multiplication... even if I just multiply the left 2x2 by the middle 2x2 I get the dreaded error:
??? Error using ==> mtimes
Inner matrix dimensions must agree.
Error in ==> #(m)CL(m)*CM(m)*CR(m)
% Vector for summation. 1 row, 301 columns with data from 0->300
m_max=301;
m=[0:m_max-1];
% Build the 300 elements for the left 2x2 matrix.
CL_11=#(m) H1(m,alpha1);
CL_12=#(m) H2(m,alpha1);
CL_21=#(m) n1*dH1(m,alpha1);
CL_22=#(m) n1*dH2(m,alpha1);
% Build the 300 elements for the middle 2x2 matrix.
CM_11=#(m) n1*dH2(m,alpha2);
CM_12=#(m) -1*H2(m,alpha2);
CM_21=#(m) -1*n1*dH1(m,alpha2);
CM_22=#(m) H1(m,alpha2);
% Build the 300 elements for the right 2x1 matrix.
CR_11=#(m) J(m,alpha3);
CR_21=#(m) n2*dJ(m,alpha3);
% Build the left (CL), middle (CM) and right (CR) matrices.
CL=#(m) [CL_11(m) CL_12(m);CL_21(m) CL_22(m)];
CM=#(m) [CM_11(m) CM_12(m);CM_21(m) CM_22(m)];
CR=#(m) [CR_11(m);CR_21(m)];
% Build the vector containing the products of each triplet of
% matrices.
C=#(m) CL(m)*CM(m)*CR(m);
cl=CL(m)
cm=CM(m)
cr=CR(m)
c=CL(m)*CM(m)*CR(m)
If you have any suggestions or recommendations, I'd greatly appreciate it! I'm still a newbie with Matlab and am trying to develop a higher level of ability with use of matrices and vectors.
Thanks!!
Your matrices are not 2x2. When you do CL_11(m) with m a 1x300 vector, CL_11(m) will be 1x300 as well. Thus CL(m) is 2x301. To get around this, you have to calculate the matrices one-by-one. There are two approaches here.
c=arrayfun(C,m,'UniformOutput',false)
will return a cell array, and so c{1} corresponds to m(1), c{2} to m(2), etc.
On the other hand, you can do
for i=1:m_max
c(:,:,i)=C(m(i));
end
and then c(:,:,i) corresponds to m(1), etc.
I'm not sure which version will be faster, but you can test it easily enough with your code.
If you go through the symbolic toolbox you can construct a function that is easier to handle.
%% symbolic
CL = sym('CL',[2,2])
CM = sym('CM',[2,2])
CR = sym('CR',[2,1])
r = CL*CM*CR
f = matlabFunction(r)
%% use some simple functions so it can be calculated as example
CL_11=#(m) m+1;
CL_12=#(m) m;
CL_21=#(m) m-1;
CL_22=#(m) m+2;
CM_11=#(m) m;
CM_12=#(m) m;
CM_21=#(m) 2*m;
CM_22=#(m) 2*m;
CR_11=#(m) m;
CR_21=#(m) 1-m;
%% here the substitution happens:
fh = #(m) f(CL_11(m),CL_12(m),CL_21(m),CL_22(m),CM_11(m),CM_12(m),CM_21(m),CM_22(m),CR_11(m),CR_21(m))
Out of interest I did a small speed test:
N=1e5;
v = 1:N;
tic
% .... insert symbolic stuff from above
r1 = fh(v);
t1=toc % gives 0.0842s for me
vs
CL=#(m) [CL_11(m) CL_12(m);CL_21(m) CL_22(m)];
CM=#(m) [CM_11(m) CM_12(m);CM_21(m) CM_22(m)];
CR=#(m) [CR_11(m);CR_21(m)];
C=#(m) CL(m)*CM(m)*CR(m);
tic
r2 =arrayfun(C,v,'UniformOutput',false);
t2=toc % gives 7.6874s for me
and
tic
r3 = nan(2,N);
for i=1:N
r3(:,i)=C(v(i));
end
t3=toc % 8.1503s for me