Storing a dynamic array of structures in Matlab - matlab

I'm new to Matlab, and I want to do the following.
I have 2500 data points that can be clustered into 10 groups. My aim is to find the top 5 data points of each cluster that is closest to the centroid. To do that, I did the following.
1) Find the distance between each point to each centroid, and allocate the closest cluster to each data point.
2) Store the data point's index (1,...,2500) and the corresponding distance in a cluster{index} array (not sure what data type this should be), where index = 1,2,...,10.
3) Go through each cluster to find the 5 closest data points.
My problem is I don't know how many data points will be stored in each cluster, so I don't know which data type I should use for my clusters and how to add to them in Step 2. I think a cell array may be what I need, but then I'll need one for the data point index and one for the distance. Or can I create a cell array of structure (each structure consisting of 2 members - index and distance). Again, how could I dynamically add to each cluster then?

I would suggest you keep the data in an normal array, this usually works the quickest in Matlab.
You could do as follows: (assuming p is an n=2500 by dim matrix of data points, and c is an m=10 by dim matrix of centroids):
dists = zeros(n,m);
for i = 1:m
dists(:,i) = sqrt(sum(bsxfun(#minus,p,c(i,:)).^2,2));
end
[mindists,groups] = min(dists,[],2);
orderOfClosenessInGroup = zeros(size(groups));
for i = 1:m
[~,permutation] = sort(mindists(groups==i));
[~,orderOfClosenessInGroup(groups==i)] = sort(permutation);
end
Then groups will be an n by 1 matrix of values 1 to m telling you which centroid the corresponding data point is closest to, and orderOfClosenessInGroup is an n by 1 matrix telling you the order of closeness inside each group (orderOfClosenessInGroup <= 5 will give you a logical vector of which data points are among the 5 closest to their centroid in their group). To illustrate it, try the following example:
n = 2500;
m = 10;
dim = 2;
c = rand(m,dim);
p = rand(n,dim);
Then run the above code, and finally plot the data as follows:
scatter(p(:,1),p(:,2),100./orderOfClosenessInGroup,[0,0,1],'x');hold on;scatter(c(:,1),c(:,2),50,[1,0,0],'o');
figure;scatter(p(orderOfClosenessInGroup<=5,1),p(orderOfClosenessInGroup<=5,2),50,[0,0,1],'x');hold on;scatter(c(:,1),c(:,2),50,[1,0,0],'o');
This will give you a result looking something like this:
and this:

Related

Finding Location of matrices within a structure in matlab

I am importing an RGB image U of the stars and doing the following:
im=rgb2gray(U);
img=(im>200);
BW=im2bw(img,0);
L=bwlabeln(BW,18);
b=regionprops(L,'PixelList');
The goal of this program is to find the largest and most prominent stars in this picture of hundreds of stars. b is a 2566x1 struct array that contains all the points with a value greater than 200. If a certain connected region within the image contains multiple values over 200, b will store a coordinate matrix of these points. Otherwise, it will only store a single coordinate pair.
I need a way to find all the rows within b that contain matrices? If possible, a way to find all the rows within b that contain matrices that contain 30 or more points?
You can use the arrayfun function to apply a function to each element in an array. Note that this is just a shorter way of writing a loop.
In this case you'd need to apply the function size(b(i).PixelList, 1) > 30 to each element i of the struct array b:
m = arrayfun(#(x)size(x.PixelList, 1) > 1, b)
This is identical to:
m = false(size(b));
for i=1:numel(b)
m(i) = size(b(i).PixelList, 1) > 30;
end
The matrix m is a logical array, you can use it to index as b(m). You can also get indices using find(m).
If you also include 'Area' in the properties calculated by regionprops, you'll already have the number of pixels in each component:
b=regionprops(L,'PixelList','Area');
idx = [b.Area] >= 30;

Plot symbols depending on vector values

I have a dataset of points represented by a 2D vector (X).
Each point belongs to a categorical data (Y) represented by an integer value(from 1 to 4).
I want to plot each point with a different symbol depending on its class.
Toy example:
X = randi(100,10,2); % 10 points ranging 1:100 in 2D space
Y = randi(4,10,1); % class of the points (1 to 4)
I create a vector of symbols for each class:
S = {'bx' 'rx' 'b.' 'r.'};
Then I try:
plot(X(:,1), X(:,2), S(Y))
Error using plot
Invalid first data argument
How can I assign to each point of X a different symbol based on the value of Y?
Of curse I can use a loop for each class and plot the different classes one by one. But is there a method to directly plot each class with a different symbol?
No need for a loop, use gscatter:
X = randi(100,10,2); % 10 points ranging 1:100 in 2D space
Y = randi(4,10,1); % class of the points (1 to 4)
color = 'brbr';
symbol = 'xx..';
gscatter(X(:,1),X(:,2),Y,color,symbol)
and you will get:
If X has many rows, but there are only a few S types, then I suggest you check out the second approach first. It's optimized for speed instead of readability. It's about twice as fast if the vector has 10 elements, and more than 200 times as fast if the vector has 1000 elements.
First approach (easy to read):
Regardless of approach, I think you need a loop for this:
hold on
arrayfun(#(n) plot(X(n,1), X(n,2), S{Y(n)}), 1:size(X,1))
Or, to write the loop in the "conventional way":
hold on
for n = 1:size(X,1)
plot(X(n,1), X(n,2), S{Y(n)})
end
Second approach (gives same plot as above):
If your dataset is large, you can sort [Y_sorted, sort_idx] = sort(Y), then use sort_idx to index X, like this: X_sorted = X(sort_idx);. After this, you split X_sorted into 4 groups, one for each of the individual Y-values, using histc and mat2cell. Then you loop over the four groups and plot each one individually.
This way you only need to loop through four values, regardless of the number of elements in your data. This should be a lot faster if the number of elements is high.
[Y_sorted, Y_index] = sort(Y);
X_sorted = X(Y_index, :);
X_cell = mat2cell(X_sorted, histc(Y,1:numel(S)));
hold on
for ii = 1:numel(X_cell)
plot(X_cell{ii}(:,1),X_cell{ii}(:,2),S{ii})
end
Benchmarking:
I did a very simple benchmarking of the two approaches using timeit. The result shows that the second approach is a lot faster:
For 10 elements:
First approach: 0.0086
Second approach: 0.0037
For 1000 elements:
First approach = 0.8409
Second approach = 0.0039

Average part of a multidimensional array based on another array (Matlab)

B = randn(1,25,10);
Z = [1;1;1;2;2;3;4;4;4;3];
Ok, so, I want to find the locations where Z=1(or any numbers that are equal to each other), then average across each of the 25 points at these specific locations. In the example you would end with a 1*25*4 array.
Is there an easy way to do this?
I'm not the most versed in Matlab.
First things first: break down the problem.
Define the groups (i.e. the set of unique Z values)
Find elements which belong to these groups
Take the average.
Once you have done that, you can begin to see it's a pretty standard for loop and "Select columns which meet criteria".
Something along the lines of:
B = randn(1,25,10);
Z = [1;1;1;2;2;3;4;4;4;3];
groups = unique(Z); %//find the set of groups
C = nan(1,25,length(groups)); %//predefine the output space for efficiency
for gi = 1:length(groups) %//for each group
idx = Z == groups(gi); %//find it's members
C(:,:,gi) = mean(B(:,:,idx), 3); %//select and mean across the third dimension
end
If B = randn(10,25); then it's very easy because Matlab function usually works down the rows.
Using logical indexing:
ind = Z == 1;
mean(B(ind,:));
If you're dealing with multiple dimensions use permute (and reshape if you actually have 3 dimensions or more) to get yourself to a point where you're averaging down the rows as above:
B = randn(1,25,10);
BB = permute(B, [3,2,1])
continue as above

How to see resampled data after BOOTSTRAP

I was trying to resample (with replacement) my database using 'bootstrap' in Matlab as follows:
D = load('Data.txt');
lead = D(:,1);
depth = D(:,2);
X = D(:,3);
Y = D(:,4);
%Bootstraping to resample 100 times
[resampling100,bootsam] = bootstrp(100,'corr',lead,depth);
%plottig the bootstraping result as histogram
hist(resampling100,10);
... ... ...
... ... ...
Though the script written above is correct, I wonder how I would be able to see/load the resampled 100 datasets created through bootstrap? 'bootsam(:)' display the indices of the data/values selected for the bootstrap samples, but not the new sample values!! Isn't it funny that I'm creating fake data from my original data and I can't even see what is created behind the scene?!?
My second question: is it possible to resample the whole matrix (in this case, D) altogether without using any function? However, I know how to create random values from a vector data using 'unidrnd'.
Thanks in advance for your help.
The answer to question 1 is that bootsam provides the indices of the resampled data. Specifically, the nth column of bootsam provides the indices of the nth resampled dataset. In your case, to obtain the nth resampled dataset you would use:
lead_resample_n = lead(bootsam(:, n));
depth_resample_n = depth(bootsam(:, n));
Regarding the second question, I'm guessing what you mean is, how would you just get a re-sampled dataset without worrying about applying a function to the resampled data. Personally, I would use randi, but in this situation, it is irrelevant whether you use randi or unidrnd. An example follows that assumes 4 columns of some data matrix D (as in your question):
%# Build an example dataset
T = 10;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, 1);
%# Obtain the resampled data
DResampled = D(Ind, :);
To create multiple re-sampled data, you can simply loop over the creation of random indices. Or you could do it in one step by creating a matrix of random indices and using that to index D. With careful use of reshape and permute you can turn this into a T*4*M array, where indexing m = 1, ..., M along the third dimension yields the mth resampled dataset. Example code follows:
%# Build an example dataset
T = 10;
M = 3;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, M);
%# Obtain the resampled data
DResampled = permute(reshape(D(Ind, :)', 4, T, []), [2 1 3]);

MATLAB: Indexing a large matrix for Monte Carlo Simulation

I'm trying to index a large matrix in MATLAB that contains numbers monotonically increasing across rows, and across columns, i.e. if the matrix is called A, for every (i,j), A(i+1,j) > A(i,j) and A(i,j+1) > A(i,j).
I need to create a random number n and compare it with the values of the matrix A, to see where that random number should be placed in the matrix A. In other words, the value of n may not equal any of the contents of the matrix, but it may lie in between any two rows and any two columns, and that determines a "bin" that identifies its position in A. Once I find this position, I increment the corresponding index in a new matrix of the same size as A.
The problem is that I want to do this 1,000,000 times. I need to create a random number a million times and do the index-checking for each of these numbers. It's a Monte Carlo Simulation of a million photons coming from a point landing on a screen; the matrix A consists of angles in spherical coordinates, and the random number is the solid angle of each incident photon.
My code so far goes something like this (I haven't copy-pasted it here because the details aren't important):
for k = 1:1000000
n = rand(1,1)*pi;
for i = length(A(:,1))
for j = length(A(1,:))
if (n > A(i-1,j)) && (n < A(i+1,j)) && (n > A(i,j-1)) && (n < A(i,j+1))
new_img(i,j) = new_img(i,j) + 1; % new_img defined previously as zeros
end
end
end
end
The "if" statement is just checking to find the indices of A that form the bounds of n.
This works perfectly fine, but it takes ridiculously long, especially since my matrix A is an image of dimensions 11856 x 11000. is there a quicker / cleverer / easier way of doing this?
Thanks in advance.
You can get rid of the inner loops by performing the calculation on all elements of A at once. Also, you can create the random numbers all at once, instead of one at a time. Note that the outermost pixels of new_img can never be different from zero.
randomNumbers = rand(1,1000000)*pi;
new_img = zeros(size(A));
tmp_img = zeros(size(A)-2);
for r = randomNumbers
tmp_img = tmp_img + A(:,1:end-2)<r & A(:,3:end)>r & A(1:end-1,:)<r & A(3:end,:)>r;
end
new_img(2:end-1,2:end-1) = tmp_img;
/aside: If the arrays were smaller, I'd have used bsxfun for the comparison, but with the array sizes in the OP, the approach would run out of memory.
Are the values in A bin edges? Ie does A specify a grid? If this is the case then you can QUICKLY populate A using hist3.
Here is an example:
numRand = 1e
n = randi(100,1e6,1);
nMatrix = [floor(data./10), mod(data,10)];
edges = {0:1:9, 0:10:99};
A = hist3(dataMat, edges);
If your A doesn't specify a grid, then you should create all of your random values once and sort them. Then iterate through those values.
Because you know that n(i) >= n(i-1) you don't have to check bins that were too small for n(i-1). This is a very easy way to optimize away most redundant checks.
Here is a snippet that should help a lot in the inner loop, it finds the location of the greatest point that is smaller than your value.
idx1 = A<value
idx2 = A(idx1) == max(A(idx1))
if you want to find the exact location you can wrap it with a find.