Using bin counts as weights for random number selection - matlab

I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?

For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.

Related

Generating random numbers with weighted distribution in Matlab?

I know how to generate random numbers in a certain range in Matlab. What i am trying to do now is generate random numbers in a range where there is more chance of getting certain ones.
For example: how could i use Matlab to generate random numbers between 0 and 2, where 50% of them will be less than 0.5?
To get numbers between 0 and 2 I would use (2-0)*rand+0. How can i do this but get a certain percentage of the numbers generated to be less than 0.5? Is there a way to do this using the rand function?
Here is a suggestion:
N = 10; % how many random numbers to generate
bounds = [0 0.5 1 2]; % define the ranges
prob = cumsum([0.5 0.3 0.2]); % define the probabilities
% pick a random range with probability from 'prob':
s = size(bounds,2)-cumsum(bsxfun(#lt,rand(N,1),prob),2);
% pick a random number in this range:
b = rand(1,N).*(bounds(s(:,end)+1)-bounds(s(:,end)))+bounds(s(:,end))
Here we have a probability of prob(k) to draw a number between bounds(k) to bounds(k+1). Basically we first draw a range with defined probability, and then draw another number from the range. So we are interested only in b, but need s on the way (mainly for creating a lot of numbers in a vectorized manner).
so we get:
b =
Columns 1 through 5
0.5297 0.15791 0.88636 0.34822 0.062666
Columns 6 through 10
0.065076 0.54618 0.0039101 0.21155 0.82779
Or, for N = 100000 we can draw:
so we can see how the values are distributed between the 3 ranges in bounds.
You can use a multinomial distribution to draw the ranges, and then compute the random numbers. Here's how:
N = 10;
bounds = [0 0.5 1 2]; % define the ranges
d = diff(bounds);
% pick a N random ranges from a multinomial distribution:
s = mnrnd(N,[0.5 0.3 0.2]);
% pick a random number in this range:
b = rand(1,N).*repelem(d,s)+repelem(bounds(1:end-1),s)
so you get s:
s =
50 39 11
that says you take 50 values from the first range, 39 from the second, and so on...
And you got the result in b:
b =
Columns 1 through 5
0.28212 0.074551 0.18166 0.035787 0.33316
Columns 6 through 10
0.12404 0.93468 1.9808 1.4522 1.6955
So basically it works the same as the first method I posted here, but it may be more accurate and/or readable. Also, I didn't test which method is faster.

Matlab: How to output a matrix which is sorted by distance

I have a cluster of points in 3D point clouds, says
A = [ 1 4 3;
1 2 3;
1 6 3;
1 5 3];
The distance matrix then was found:
D= pdist(A);
Z= squareform(D);
Z =
0 2 2 1
2 0 4 3
2 4 0 1
1 3 1 0
I would like to sort the points so that the sum of the distance travelled through the points will be the smallest, and output in another matrix. This is similar to TSP problem but in a 3D model. Is there any function can do this?
Your help is really appreciated in advance.
This could be one approach and must be efficient enough for a wide range of datasizes -
D = pdist(A);
Z = squareform(D); %// Get distance matrix
N = size(A,1); %// Store the size of the input array for later usage
Z(1:N+1:end) = Inf; %// Set diagonals as Infinites as we intend to find
%// minimum along each row
%// Starting point and initialize an array to store the indices according
%// to the sorted requirements set in the question
idx = 1;
out_idx = zeros(N,1);
out_idx(1) = idx;
%// Perform an iterative search to look for nearest one starting from point-1
for k = 2:N
start_ind = idx;
[~,idx] = min(Z(start_ind,:));
Z(:,start_ind) = Inf;
out_idx(k) = idx;
end
%// Now that you have the list of indices based on the next closest one,
%// sort the input array based on those indices and have the desired output
out = A(out_idx,:)
Sample run for given input -
A =
1 4 3
1 2 3
1 6 3
1 5 3
1 2 3
out =
1 4 3
1 5 3
1 6 3
1 2 3
1 2 3
The only way I can see you do this is by brute force. Also bear in mind that because this is brute force, this will scale very badly as the total number of points increases. This is fine for just 4 points, but if you want to scale this up, the total number of permutations for N points would be N! so be mindful of this before using this approach. If the number of points increases, then you may get to a point where you run out of memory. For example, for 10 points, 10! = 3628800, so this probably won't bode well with memory if you try and go beyond 10 points.
What I can suggest is to generate all possible permutations of visiting the 4 points, then for each pair of points (pt. 1 -> pt. 2, pt. 2 -> pt. 3, pt. 3 -> pt. 4), determine and accumulate the distances, then find the minimum distance accumulated. Whichever distance is the minimum will give you the sequence of nodes you need to visit.
Start with perms to generate all possible ways to visit four points exactly once, then for each pair of points, figure out the distances between the pairs and accumulate the distances. Keep considering pairs of points along each unique permutation until we reach the end. Once we're done, find the smallest distance that was generated, and return the sequence of points to generate this sequence.
Something like:
%// Your code
A = [ 1 4 3;
1 2 3;
1 6 3;
1 5 3];
D = pdist(A);
Z = squareform(D);
%// Generate all possible permutations to visit for our points
V = perms(1:size(A,1));
%// Used to accumulate our distances per point pair
dists = zeros(size(V,1), 1);
%// For each point pair
for idx = 1 : size(V,2)-1
%// Get the point pair in the sequence
p1 = V(:,idx);
p2 = V(:,idx+1);
%// Figure out the distance between the two points and add them up
dists = dists + Z(sub2ind(size(Z), p1, p2));
end
%// Find which sequence gave the minimum distance travelled
[~,min_idx] = min(dists);
%// Find the sequence of points to generate the minimum
seq = V(min_idx,:);
%// Give the actual points themselves
out = A(seq,:);
seq and out give the actual sequence of points we need to visit, followed by the actual points themselves. Note that we find one such possible combination. There may be a chance that there is more than one possible way to get the minimum distance travelled. This code just returns one possible combination. As such, what I get with the above is:
>> seq
seq =
3 4 1 2
>> out
out =
1 6 3
1 5 3
1 4 3
1 2 3
What the above is saying is that we need to start at point 3, then move to point 4, point 1, then end at point 2. Also, the sequence of pairs of points we need to visit is points 3 and 4, then points 4 and 1 and finally points 1 and 2. The distances are:
Pt. 3 - Pt. 4 - 1
Pt. 4 - Pt. 1 - 1
Pt. 1 - Pt. 2 - 2
Total distance = 4
If you take a look at this particular problem, the minimum possible distance would be 4 but there is certainly more than one way to get the distance 4. This code just gives you one such possible traversal.

Matlab: Array of random integers with no direct repetition

For my experiment I have 20 categories which contain 9 pictures each. I want to show these pictures in a pseudo-random sequence where the only constraint to randomness is that one image may not be followed directly by one of the same category.
So I need something similar to
r = randi([1 20],1,180);
just with an added constraint of two numbers not directly following each other. E.g.
14 8 15 15 7 16 6 4 1 8 is not legitimate, whereas
14 8 15 7 15 16 6 4 1 8 would be.
An alternative way I was thinking of was naming the categories A,B,C,...T, have them repeat 9 times and then shuffle the bunch. But there you run into the same problem I think?
I am an absolute Matlab beginner, so any guidance will be welcome.
The following uses modulo operations to make sure each value is different from the previous one:
m = 20; %// number of categories
n = 180; %// desired number of samples
x = [randi(m)-1 randi(m-1, [1 n-1])];
x = mod(cumsum(x), m) + 1;
How the code works
In the third line, the first entry of x is a random value between 0 and m-1. Each subsequent entry represents the change that, modulo m, will give the next value (this is done in the fourth line).
The key is to choose that change between 1 and m-1 (not between 0 and m-1), to assure consecutive values will be different. In other words, given a value, there are m-1 (not m) choices for the next value.
After the modulo operation, 1 is added to to transform the range of resulting values from 0,...,m-1 to 1,...,m.
Test
Take all (n-1) pairs of consecutive entries in the generated x vector and count occurrences of all (m^2) possible combinations of values:
count = accumarray([x(1:end-1); x(2:end)].', 1, [m m]);
imagesc(count)
axis square
colorbar
The following image has been obtained for m=20; n=1e6;. It is seen that all combinations are (more or less) equally likely, except for pairs with repeated values, which never occur.
You could look for the repetitions in an iterative manner and put new set of integers from the same group [1 20] only into those places where repetitions have occurred. We continue to do so until there are no repetitions left -
interval = [1 20]; %// interval from where the random integers are to be chosen
r = randi(interval,1,180); %// create the first batch of numbers
idx = diff(r)==0; %// logical array, where 1s denote repetitions for first batch
while nnz(idx)~=0
idx = diff(r)==0; %// logical array, where 1s denote repetitions for
%// subsequent batches
rN = randi(interval,1,nnz(idx)); %// new set of random integers to be placed
%// at the positions where repetitions have occured
r(find(idx)+1) = rN; %// place ramdom integers at their respective positions
end

Why Matlab K-means does not find the best centroids while Excel Solver does?

I have a data set as follows:
Data = [4 12; 5 10; 8 7; 5 3; 5 4; 2 11; 5 4; 3 8; 6 2; 7 4; 10 8; 8 9; 10 9; 10 12]
Then I proceed with:
[idx,ctrs, sumD] = kmeans(Data,3)
It gives me the centroids and sumD (sums of point-to-centroid distances within cluster) like:
ctrs = [5.6000 3.4000; 3.5000 10.2500; 9.2000 9.0000]
sumD = [6.4000; 13.7500; 18.8000]
Whereas according to Excel Solver (from a published article), ctrs and sumD are as follows for k=3:
ctrs = [5.21815716 3.66736761; 3.615385665 10.461533; 9.47841197 8.75055345]
sumD = [5.151897802; 7.285383286; 8.573829765]
(NB: In that article, the authors give an initial (seed) centroid to each cluster such as [4 4; 5 12; 10 6] by visual decision from the plot.)
Apparently, Excel finds more accurate ctrs values thereby smaller sumD values. I could not achieve this with Matlab. That's why I used other parameters of kmeans function. I used 'replicates'` and 'options' (MaxIter) and also 'start' parameters - even with 3D array seed - to no avail. I even adopted the same initial seed from the article to Matlab. Followings are what I tried and failed:
First:
opts = statset('MaxIter',100);
Seed = [4 4; 5 12; 10 6];
[idx,ctrs] = kmeans(Data,3,'Replicates',50,'options',opts,'start',Seed)
This gives an error: The third dimension of the 'Start' array must match the 'replicates' parameter value.
Second:
I created a 3D array of 50 pages where the first page is the same initial seed above and the rest 49 are random. I created the random pages as:
T = rand(3,2,49);
After that, I created the 50 pages 3D array like this:
Seed2 = cat(3,Seed,T);
Then used kmeans:
[idx,ctrs] = kmeans(Data,3,'Replicates',50,'options',opts,'start',Seed2)
However, Matlab gave warnings indicated that all the replicates after the first replication were terminated due to empty cluster created at iteration 1. Also, the idx, ctrs and sumD values obtained were still the same as before - as if I ran my very first function above (i.e. [idx,ctrs, sumD] = kmeans(Data,3) ).
I am stuck. I am trying to verify the results of the Excel solver published in the article using Matlab because then I will apply the same algorithm applied on 14 observations from the article to a larger data set of 900+ observations.
What am I doing wrong? What should I correct in my coding to obtain the same or much similar result of the Excel Solver?
The difference appears to be in the choice of the measure of distance used, not in the coding. There is more than one way to define "distance" in this context.
MATLAB uses squared Euclidean distance by default. By hand calculating this with the MATLAB results I can replicate the sumD results you get. However, using squared Euclidean distance measure with the results you give from the paper gives a higher value of sumD.
I get the same results for sumD as the paper if I use plain (not squared) Euclidean distance. Using this measure the MATLAB results return higher values for sumD.
So neither result is wrong as such, they're just measuring "rightness" in different ways.
How can you be certain that excel values are correct and MATLAB kmeans gives you not so accurate result.
With the quick MATLAB script below, I plotted the centroids, and at least visually it seems correct
Data = [4 12; 5 10; 8 7; 5 3; 5 4; 2 11; 5 4; 3 8; 6 2; 7 4; 10 8; 8 9; 10 9; 10 12];
plot(Data(:,1), Data(:,2),'ob','markersize', 10);
axis([min(Data(:,1))-2, max(Data(:,1))+2, min(Data(:,2))-2, max(Data(:,2))+2]);
hold on;
[idx,ctrs, sumD] = kmeans(Data,3);
plot(ctrs(:,1), ctrs(:,2), '*r', 'markersize', 10);
If this is not accurate enough, Instead of trying to customize MATLAB's kmeans, we can define our kmean function. I had implemented the kmeans sometime ago and it seemed easier that asking matlab to fine tune the parameters.

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).