Knowing which elements from dataset went into tallest bin in histogram? MATLAB - matlab

I want to know which elements from my data set went into the tallest bin in a bivariate histogram, and have not found information on how to do this online. I suspect this is possible since it is fairly useful.
I know I can do some other code that helps me find it but I was wondering if there is a succinct way of doing this. For example I could search through the dataset with a conditional that helps me extract the things falling into the bins but I'm not interested in that. Right now I have written
X = [Eavg,Estdev];
hist3(X,[15 15])
The result is a 15x15 bin bivariate histogram. I want to extract the elements in the tallest bin in a very terse manner.
I'm doing a statistical mechanics (Monte Carlo) simulation, if its worth mentioning...

The signature [N, CEN] = hist3(... returns bincounts and center of bins. Bin centers can be converted to bin edges. Then edges can be use to find which data elements fall into a specific bin.
X = randi([1 100],10,2);
[N, CEN] = hist3(X,[5 5]);
%find row and column of highest value of histogram
%since there may be multiple histogram values that
%are equal to maximum value then we select the first one
[r,c]= find(N==max(N(:)),1);
%convert cell of bin centers to vector
R = [CEN{1}];
C = [CEN{2}];
%convert bin centers to edges
%realmax used to include values that
%are beyond the first and the last computed edges
ER = [-realmax R(1:end-1)+diff(R)/2 realmax];
EC = [-realmax C(1:end-1)+diff(C)/2 realmax];
%logical indices of rows where data fall into specified bin
IDX = X(:,1)>= ER(r) & X(:,1)< ER(r+1) & X(:,2)>= EC(c) & X(:,2)< EC(c+1)

Related

Histogram with logarithmic bins and normalized

I want to make a histogram of every column of a matrix, but I want the bins to be logarithmic and also normalized. And after I create the histogram I want to make a fit on it without showing the bars. This is what I have tried:
y=histogram(x,'Normalized','probability');
This gives me the histogram normalized, but I don't know how to make the bins logarithmic.
There are two different ways of creating a logarithmic histogram:
Compute the histogram of the logarithm of the data. This is probably the nicest approach, as you let the software decide on how many bins to create, etc. The x-axis now doesn't match your data, it matches the log of your data. For fitting a function, this is likely beneficial, but for display it could be confusing. Here I change the tick mark labels to show the actual value, keeping the tick marks themselves at their original values:
y = histogram(log(x),'Normalization','probability');
h = gca;
h.XTickLabels = exp(h.XTick);
Determine your own bin edges, on a logarithmic scale. Here you need to determine how many bins you need, depending on the number of samples and the distribution of samples.
b = 2.^(1:0.25:3);
y = histogram(x,b,'Normalization','probability');
set(gca,'XTick',b) % This just puts the tick marks in between bars so you can see what we did.
Method 1 lets MATLAB determine number of bins and bin edges automatically depending on the input data. Hence it is not suitable for creating multiple matching histograms. For that case, use method 2. The in edges can be obtained more simply this way:
N = 10; % number of bins
start = min(x); % first bin edge
stop = max(x); % last bin edge
b = 2.^linspace(log2(start),log2(stop),N+1);
I think the correct syntax would be Normalization.
To make it logarithmic, you have to change the axes object.
For example :
ha = axes;
y = histogram( x,'Normalization','probability' );
ha.YScale = 'log';

Matlab : Help in finding minimum distance

I am trying to find the point that is at a minimum distance from the candidate set. Z is a matrix where the rows are the dimension and columns indicate points. Computing the inter-point distances, and then recording the point with minimum distance and its distance as well. Below is the code snippet. The code works fine for a small dimension and small set of points. But, it takes a long time for large data set (N = 1 million data points and dimension is also high). Is there an efficient way?
I suggest that you use pdist to do the heavy lifting for you. This function will compute the pairwise distance between every two points in your array. The resulting vector has to be put into matrix form using squareform in order to find the minimal value for each pair:
N = 100;
Z = rand(2,N); % each column is a 2-dimensional point
% pdist assumes that the second index corresponds to dimensions
% so we need to transpose inside pdist()
distmatrix = squareform(pdist(Z.','euclidean')); % output is [N, N] in size
% set diagonal values to infinity to avoid getting 0 self-distance as minimum
distmatrix = distmatrix + diag(inf(1,size(distmatrix,1)));
mindists = min(distmatrix,[],2); % find the minimum for each row
sum_dist = sum(mindists); % sum of minimal distance between each pair of points
This computes every pair twice, but I think this is true for your original implementation.
The idea is that pdist computes the pairwise distance between the columns of its input. So we put the transpose of Z into pdist. Since the full output is always a square matrix with zero diagonal, pdist is implemented such that it only returns the values above the diagonal, in a vector. So a call to squareform is needed to get the proper distance matrix. Then, the row-wise minimum of this matrix have to be found, but first we have to exclude the zero in the diagonals. I was lazy so I put inf into the diagonals, to make sure that the minimum is elsewhere. In the end we just have to sum up the minimal distances.

Find the minimum absolute values along the third dimension in a 3D matrix and ensuring the sign is maintained

I have a m X n X k matrix and I want to find the elements that have minimal absolute value along the third dimension for each unique 2D spatial coordinate. An additional constraint is that once I find these minimum values, the sign of these values (i.e. before I took the absolute value) must be maintained.
The code I wrote to accomplish this is shown below.
tmp = abs(dist); %the size(dist)=[m,n,k]
[v,ind] = min(tmp,[],3); %find the index of minimal absolute value in the 3rd dimension
ind = reshape(ind,m*n,1);
[col,row]=meshgrid(1:n,1:m); row = reshape(row,m*n,1); col = reshape(col,m*n,1);
ind2 = sub2ind(size(dist),row,col,ind); % row, col, ind are sub
dm = dist(ind2); %take the signed value from dist
dm = reshape(dm,m,n);
The resulting matrix dm which is m X n corresponds to the matrix that is subject to the constraints that I have mentioned previously. However, this code sounds a little bit inefficient since I have to generate linear indices. Is there any way to improve this?
If I'm interpreting your problem statement correctly, you wish to find the minimum absolute value along the third dimension for each unique 2D spatial coordinate in your 3D matrix. This is already being done by the first two lines of your code.
However, a small caveat is that once you find these minimum values, you must ensure that the original sign of these values (i.e. before taking the absolute value) are respected. That is the purpose of the rest of the code.
If you want to select the original values, you don't have a choice but to generate the correct linear indices and sample from the original matrix. However, a lot of code is rather superfluous. There is no need to perform any kind of reshaping.
We can simplify your method by using ndgrid to generate the correct spatial coordinates to sample from the 3D matrix then use ind from your code to reference the third dimension. After, use this to sample dist and complete your code:
%// From your code
[v,ind] = min(abs(dist),[],3);
%// New code
[row,col] = ndgrid(1:size(dist,1), 1:size(dist,2));
%// Output
dm = dist(sub2ind(size(dist), row, col, ind));

How to understand the Matlab build in function "kmeans"?

Suppose I have a matrix A, the size of which is 2000*1000 double. Then I apply
Matlab build in function "kmeans"to the matrix A.
k = 8;
[idx,C] = kmeans(A, k, 'Distance', 'cosine');
I get C = 8*1000 double; idx = 2000*1 double, with values from 1 to 8;
According to the documentation, C returns the k cluster centroid locations in the k-by-p (8 by 1000) matrix. And idx returns an n-by-1 vector containing cluster indices of each observation.
My question is:
1) I do not know how to understand the C, the centroid locations. Locations should be represented as (x,y), right? How to understand the matrix C correctly?
2) What are the final centers c1, c2,...,ck? Are they just values or locations?
3) For each cluster, if I only want to get the vector closest to the center of this cluster, how to calculate and get it?
Thanks!
Before I answer the three parts, I'll just explain the syntax that is used in MATLAB's explanation of k-means (http://www.mathworks.com/help/stats/kmeans.html).
A is your data matrix (it's represented as X in the link). There are n rows (in this case, 2000), which represent the number of observations/data points that you have. There are also p columns (in this case, 1000), which represent the number of "features" that each data points has. For example, if your data consisted of 2D points, then p would equal 2.
k is the number of clusters that you want to group the data into. Based on the dimensions of C that you gave, k must be 8.
Now I will answer the three parts:
The C matrix has dimensions k x p. Each row represents a centroid. Centroid locations DO NOT have to be (x, y) at all. The dimensions of the centroid locations are equal to p. In other words, if you have 2D points, you could graph the centroids as (x, y). If you have 3D points, you could graph the centroids as (x, y, z). Since each data point in A has 1000 features, your centroids therefore have 1000 dimensions.
This is sort of difficult to explain without knowing what your data is exactly. Centroids are certainly not just values, and they may not necessarily be locations. If your data A were coordinate points, you could certainly represent the centroids as locations. However, we can view it more generally. If you had a cluster centroid i and the data points v that are grouped with that centroid, the centroid would represent the data point that is most similar to those in its cluster. Hopefully, that makes sense, and I can give a clearer explanation if necessary.
The k-means method actually gives us a good way to accomplish this. The function actually has 4 possible outputs, but I will focus on the 4th, which I will call D:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine');
D has dimensions n x k. For a data point i, the row i in the D matrix gives the distance from that point to every centroid. Therefore, for each centroid, you simply need to find the data point closest to this, and return that corresponding data point. I can supply the short code for this if you need it.
Also, just a tip. You should probably use kmeans++ method of initializing the centroids. It's faster and generally better. You can call it using this:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine', 'Start', 'plus');
Edit:
Here is the code necessary for part 3:
[~, min_idxs] = min(D, [], 1);
closest_vecs = A(min_idxs, :);
Each row i of closest_vecs is the vector that is closest to centroid i.
OK, before we actually get into the details, let's give a brief overview on what K-means clustering is first.
k-means clustering works such that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
How you use the kmeans function in MATLAB is that assuming you have a data matrix (A in your case), it is arranged such that each row is a sample and each column is a feature / dimension of a sample. For example, we could have N x 2 or N x 3 arrays of Cartesian coordinates, either in 2D or 3D. In colour images, we could have N x 3 arrays where each column is a colour component in an image - red, green or blue.
How you invoke kmeans in MATLAB is the following way:
[IDX, C] = kmeans(X, K);
X is the data matrix we talked about, K is the total number of clusters / groups you would like to see and the outputs IDX and C are respectively an index and centroid matrix. IDX is a N x 1 array where N is the total number of samples that you have put into the function. Each value in IDX tells you which centroid the sample / row in X best matched with a particular centroid. You can also override the distance measure used to measure the distance between points. By default, this is the Euclidean distance, but you used the cosine distance in your invocation.
C has K rows where each row is a centroid. Therefore, for the case of Cartesian coordinates, this would be a K x 2 or K x 3 array. Therefore, you would interpret IDX as telling which group / centroid that the point is closest to when computing k-means. As such, if we got a value of IDX=1 for a point, this means that the point best matched with the first centroid, which is the first row of C. Similarly, if we got a value of IDX=1 for a point, this means that the point best matched with the third centroid, which is the third row of C.
Now to answer your questions:
We just talked about C and IDX so this should be clear.
The final centres are stored in C. Each row gives you a centroid / centre that is representative of a group.
It sounds like you want to find the closest point to each cluster in the data, besides the actual centroid itself. That's easy to do if you use knnsearch which performs K-Nearest Neighbour search by giving a set of points and it outputs the K closest points within your data that are close to a query point. As such, you supply the clusters as the input and your data as the output, then use K=2 and skip the first point. The first point will have a distance of 0 as this will be equal to the centroid itself and the second point will give you the closest point that is closest to the cluster.
You can do that by the following, assuming you already ran kmeans:
out = knnsearch(A, C, 'k', 2);
out = out(:,2);
You run knnsearch, then toss out the closest point as it would essentially have a distance of 0. The second column is what you're after, which gives you the closest point to the cluster excluding the actual centroid. out will give you which points in your data matrix A that was closest to each centroid. To get the actual points, do this:
pts = A(out,:);
Hope this helps!

different sized bins in matlab

In Matlab I have a vector Muen which I want to reduce in size by dividing it in to different length bins. The vector has a few values that need high accuracy bins and a lot of values that are roughly equal and could be collected into bins with size of up to a few hundred values.
I also need to know the index for all old bins going into a new bin in order to shorten a sencod vector fluence.
The goal is to speed up a summation of two vectors sum(fluence.*Muen) by using different sized bins determined by Meun and do the sum of fluence into the new bins before the vector multiplication.
For this I try to use
edges=[min(Muen):0.0001:Muen(13),Muen(12:-1:1));
[N,bin]=histc(*Muen*,edges)
The problem is how to make the vector edges, as there is a large difference between the maximum and minimum of Muen and a small difference between other values. Is there a way to make the steps of edges depending on the derivative Muen?
In order to get the shorter version of Muen would be something like
MuenShort=N.*edges;
but it did not work quit right (could be a fault in edges), any suggestions?
I also do not really get how bin gives the index of the values that go into the new bins?
clarification:
what I want to do is from a vector m or Muen take the elements that are roughly equal and replace the with one element and at the same time keeping track of the index for which element goes into a new vector n or MuenShort. example
{m1}->n1,(1), {m2}->n2,(2), {m3,m4}-> m3=m4=n3,(3,4),{m5,m6,m7,m8}-> m5=m6=m7=m8=n4,{5,6,7,8}...
where n1>>n2 but the difference between n3 and n4 might not be so large. the number of m-elements in each n-element should be determined by the number of m-elements that are roughly equal to each other, or rather lies between two limits. So the bin size should vary between one element to a few hundred elements.
Then I want to use indexes to make the fluence vector shorter
fluenceShort(1:length(MuenShort))= [sum(fluence(1)),sum(fluence(2)),sum(fluence(3,4)),sum(fluence(5,6,7,8))...];
goal=sum(fluenceShort.*MuenShort)
Is there a way to implement this in Matlab?
Even if I don't understand your question clearly, I would suggest this. Perhaps you could sort your vector muen, pick a fixed number n, and define each bin so that it contains exactly n values of muen. For simplicity, the length of muen is assumed to be a multiple of n:
n = 10;
m = length(muen_sorted)/n;
muen_sorted = sort(muen);
edges = [-inf mean([muen_sorted(n:n:end-1); muen_sorted(n+1:n:end)]) inf ];
muen_short = mean(reshape(muen_sorted,n,m));
Note that m+1 edges (vector edges) are obtained, corresponding to m bins. Bin edges lie exactly between the closest values of neighbouring bins. Thus, the upper edge of the first bin is (muen_sorted(n)+muen_sorted(n+1)/2; the upper edge of the next bin is (muen_sorted(2*n)+muen_sorted(2*n+1)/2, and so on.
The "representative value" of each bin (vector muen_short) is computed as the mean of the values that lie in that bin. Or perhaps the median would make more sense, depending on your application.
As a result of this code, muen_short(1) is the value corresponding to the bin with edges edge(1) and edge(2); muen_short(2) is the value corresponding to the bin with edges edge(2) and edge(3), etc.
You can now use the variable edges to build the histogram of fluence with those same edges.