Task: I am working in Matlab and I have to construct a dendrogram from maximum values of a matrix of Euclidean distance.
What have I done so far: I have constructed the distance matrix based on the correlation coefficients of returns of prices (this is what I have in my application). I have also built the MST based on these distances. Now I have to construct the ultrametric matrix which is obtained by defining the subdominant ultrametric distance D*ij between i and j as the maximum value of any Euclidean distance Dkl detected by moving in single steps from i to j in the MST.
CorrelMatrix=corrcoef(Returns);
DistMatrix=sqrt(2.*(1-CorrelMatrix));
DG=sparse(DistMatrix);
[ST,pred] = graphminspantree(DG,'Method','Prim');
Z = linkage(DistMatrix);
dendrogram(Z)
I am a newbie in Matlab and I do not know if there is a function or something that I should use to find the maximum distance between two nodes, and to put if after in a matrix.
Related
Euclidean is distance transform from a . I am using Euclidean = bwdist(a,'euclidean');
Based on this, may i know how the calculation works? from/to what point MATLAB calculate to get the Euclidean based on MATLAB?
From formula , sqrt[(x2-x1)^2 + (y2-y1)^2], which means we need 2 points. How do MATLAB calculation for each pixel? Thank you
I think the following link explains about the function quite directly.
https://uk.mathworks.com/help/images/ref/bwdist.html
D = bwdist(BW) computes the Euclidean distance transform of the binary image BW. For each pixel in BW, the distance transform assigns a number that is the distance between that pixel and the nearest nonzero pixel of BW.
for your first point a(1,1), the nearest point is a(2,2), so the distance is sqrt(2).
for a(1,2), the nearest non-zero is a(2,2) too, so the distance is sqrt(1) = 1.
for a(2,2), the nearest non-zero is it self, so the distance is sqrt(0) = 0.
Good luck.
Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)
I am trying to find the point that is at a minimum distance from the candidate set. Z is a matrix where the rows are the dimension and columns indicate points. Computing the inter-point distances, and then recording the point with minimum distance and its distance as well. Below is the code snippet. The code works fine for a small dimension and small set of points. But, it takes a long time for large data set (N = 1 million data points and dimension is also high). Is there an efficient way?
I suggest that you use pdist to do the heavy lifting for you. This function will compute the pairwise distance between every two points in your array. The resulting vector has to be put into matrix form using squareform in order to find the minimal value for each pair:
N = 100;
Z = rand(2,N); % each column is a 2-dimensional point
% pdist assumes that the second index corresponds to dimensions
% so we need to transpose inside pdist()
distmatrix = squareform(pdist(Z.','euclidean')); % output is [N, N] in size
% set diagonal values to infinity to avoid getting 0 self-distance as minimum
distmatrix = distmatrix + diag(inf(1,size(distmatrix,1)));
mindists = min(distmatrix,[],2); % find the minimum for each row
sum_dist = sum(mindists); % sum of minimal distance between each pair of points
This computes every pair twice, but I think this is true for your original implementation.
The idea is that pdist computes the pairwise distance between the columns of its input. So we put the transpose of Z into pdist. Since the full output is always a square matrix with zero diagonal, pdist is implemented such that it only returns the values above the diagonal, in a vector. So a call to squareform is needed to get the proper distance matrix. Then, the row-wise minimum of this matrix have to be found, but first we have to exclude the zero in the diagonals. I was lazy so I put inf into the diagonals, to make sure that the minimum is elsewhere. In the end we just have to sum up the minimal distances.
I have two correlated Nx3 datasets (one is xyz points, the other is the normal vector for those points). I have a point in my first dataset and now I want to find the matching row in the second dataset. What's the best way to do this? I was thinking print out the row number but not sure exactly what the code is to do that?
Given that you have a point in your one dataset that is size 1 x 3, there are two possible ways that you can do this.
Method #1 - Using knnsearch
The easiest way would be to use knnsearch from the Statistics Toolbox.
knnsearch stands for K-Nearest Neighbour search. Given an input query point, knnsearch finds the k closest points to your dataset given the input query point. In your case, k=1. Also, the distance metric is the Euclidean distance, but seeing how your points are in 3D Cartesian space, I don't see this being a problem.
Therefore, assuming your xyz points are stored in X and the query point (normal vector) is in y, just do this:
IDX = knnsearch(X, y);
The above defaults to k=1. If you'd like more than 1 point returned, you'd do this:
IDX = knnsearch(X, y, 'K', n);
n is the number of points you want returned or the n closest points given the query y. IDX contains the index of which point in X is closest to y. I would also like to point out that X is arranged such that each row is a point and each column is a variable.
Therefore, the closest point using IDX would be:
closest_point = X(IDX,:);
Method #2 - Using bsxfun
If you don't have the Statistics Toolbox, you can very easily achieve the same thing using bsxfun. Bear in mind that the code I will write is only for returning the closest point, or k=1:
dists = sqrt(sum(bsxfun(#minus, X, y).^2, 2));
[~,IDX] = min(dists);
The bsxfun call first determines the component-wise distance between y and every point in X. Once we do this, we square each component, add up all of the components together then take the square root. This essentially finds the Euclidean distance with y and all of the points in X. This gives us N distances where N is the total number of points in the dataset. We then find the minimum distance with min and determine the index of where the closest matching point is, which corresponds to the closest point between y and the dataset.
If you'd like to extend this to more than one point, you'd sort the distances in ascending order, then retrieve those number of points with the smallest distances. Remember, smaller Euclidean distances mean that the points are similar, which is why we sort in ascending order. Something like this:
dists = sqrt(sum(bsxfun(#minus, X, y).^2, 2));
[~,ind] = sort(dists);
IDX = ind(1:n);
Just a small step upwards from what we had before. Instead of using min, you'd use sort and get the second output of sort to determine the locations of the minimum distances. We'd then index into ind to get the n closest indices and finally index into X to get our actual points.
You would again do the same thing to retrieve the actual points that are closest:
closest_point = X(IDX,:);
Some Bonus Material
If you'd like to read more about how K-Nearest Neighbour works, I encourage you to read my post about it here:
Finding K-nearest neighbors and its implementation
Good luck!
I am trying to find the Mahalanobis distance of some points from the origin.The MATLAB command for that is mahal(Y,X)
But if I use this I get NaN as the matrix X =0 as the distance needs to be found from the origin.Can someone please help me with this.How should it be done
I think you are a bit confused about what mahal() is doing. First, computation of the Mahalanobis distance requires a population of points, from which the covariance will be calculated.
In the Matlab docs for this function it makes it clear that the distance being computed is:
d(I) = (Y(I,:)-mu)*inv(SIGMA)*(Y(I,:)-mu)'
where mu is the population average of X and SIGMA is the population covariance matrix of X. Since your population consists of a single point (the origin), it has no covariance, and so the SIGMA matrix is not invertible, hence the error where you get NaN/Inf values in the distances.
If you know the covariance structure that you want to use for the Mahalanobis distance, then you can just use the formula above to compute it for yourself. Let's say that the covariance you care about is stored in a matrix S. You want the distance w.r.t. the origin, so you don't need to subtract anything from the values in Y, all you need to compute is:
for ii = 1:size(Y,1)
d(ii) = Y(ii,:)*inv(S)*Y(ii,:)'; % Where Y(ii,:) is assumed to be a row vector.'
end