Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Find median value of the largest clump of similar values in an array in the most computationally efficient manner - matlab

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?

This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

Related

Using bin counts as weights for random number selection

I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?

For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.

Matlab: Array of random integers with no direct repetition

For my experiment I have 20 categories which contain 9 pictures each. I want to show these pictures in a pseudo-random sequence where the only constraint to randomness is that one image may not be followed directly by one of the same category.
So I need something similar to
r = randi([1 20],1,180);
just with an added constraint of two numbers not directly following each other. E.g.
14 8 15 15 7 16 6 4 1 8 is not legitimate, whereas
14 8 15 7 15 16 6 4 1 8 would be.
An alternative way I was thinking of was naming the categories A,B,C,...T, have them repeat 9 times and then shuffle the bunch. But there you run into the same problem I think?
I am an absolute Matlab beginner, so any guidance will be welcome.

The following uses modulo operations to make sure each value is different from the previous one:
m = 20; %// number of categories
n = 180; %// desired number of samples
x = [randi(m)-1 randi(m-1, [1 n-1])];
x = mod(cumsum(x), m) + 1;
How the code works
In the third line, the first entry of x is a random value between 0 and m-1. Each subsequent entry represents the change that, modulo m, will give the next value (this is done in the fourth line).
The key is to choose that change between 1 and m-1 (not between 0 and m-1), to assure consecutive values will be different. In other words, given a value, there are m-1 (not m) choices for the next value.
After the modulo operation, 1 is added to to transform the range of resulting values from 0,...,m-1 to 1,...,m.
Test
Take all (n-1) pairs of consecutive entries in the generated x vector and count occurrences of all (m^2) possible combinations of values:
count = accumarray([x(1:end-1); x(2:end)].', 1, [m m]);
imagesc(count)
axis square
colorbar
The following image has been obtained for m=20; n=1e6;. It is seen that all combinations are (more or less) equally likely, except for pairs with repeated values, which never occur.

You could look for the repetitions in an iterative manner and put new set of integers from the same group [1 20] only into those places where repetitions have occurred. We continue to do so until there are no repetitions left -
interval = [1 20]; %// interval from where the random integers are to be chosen
r = randi(interval,1,180); %// create the first batch of numbers
idx = diff(r)==0; %// logical array, where 1s denote repetitions for first batch
while nnz(idx)~=0
idx = diff(r)==0; %// logical array, where 1s denote repetitions for
%// subsequent batches
rN = randi(interval,1,nnz(idx)); %// new set of random integers to be placed
%// at the positions where repetitions have occured
r(find(idx)+1) = rN; %// place ramdom integers at their respective positions
end

Genetic algorithm: Minimum Number of Generations?

I have a Matlab script (actually a function, funModel), which I'm trying to solve with 7 integer variables via a genetic algorithm:
nvars = 7; %number of variables
Aineq = [1 1 1 1 1 1 1]; Aeq = [];
bineq = [VesMaxCrew]; beq = [];
LowBound = [1 1 1 1 1 4 0];
UpBound = [1 1 VesMaxCrew 1 VesMaxCrew VesMaxCrew VesMaxCrew];
Nonlcon = [];
IntCon = [1:7]; % all 7 variables to be treated as integers
Options = gaoptimset('Display','iter',... %display every iteration
'Generations',70,... %maximum number of generations is 70
'TolFun',1,... %tolerance for optimisation is 1
'TolCon',1,...
'PlotFcns',#gaplotbestf);
OptimisedValue = ga(#funModel,nvars,Aineq,bineq,Aeq,beq,,LowBound,UpBound,NonlCon,IntCon,Options);
The genetic algorithm works fine and finds a good solution, easily within 70 generations (as can be seen with the plot function #gaplotbestf). With the current input, the optimal solution is chosen for every individual after 25 to 30 generations. The algorithm, however, continues to run until 51 generations have been made. This would seem like at least 20 generations too many.
Even if I change the input parameters of funModel, the genetic algorithm still runs at least 51 generations, like there is some constraint or setting saying the algorithm has to run 51 generations minimum. (As can be seen, a maximum number of generations has been entered)
Why doesn't the algorithm stop between 25 or 30 generations? (or just after 30 generations)
And more importantly, does anyone know how to alter this?
(I haven't been able to find anything about a setting (gaoptimset) of minimum generations in the Matlab documentation. Neither have I been able to find somebody with the same problem/question.)

"Stall generations" option has default value of 50. This is actually the point where it stops in your case. This can be considered as a minimum number of generations. For more details please check here.

MATLAB vector: prevent consecutive values from same range

Okay, this might seem like a weird question, but bear with me.
So I have a random vector in a .m file, with certain constraints built into it. Here is my code:
randvecall = randsample(done, done, true);
randvec = randvecall([1;diff(randvecall(:))]~=0);
"Done" is just the range of values we take the sample from, so don't worry about that. As you can see, this randsamples from a range of values, and then prunes this random vector with the diff function, so that consecutive duplicate values are removed. There is still the potential for duplicate values in the vector, but they simply cannot be consecutive.
This is all well and good, and works perfectly fine.
So, say, randvec looks like this:
randvec =
54
47
52
26
39
2
14
51
24
6
19
56
34
46
12
7
41
18
29
7
It is actually a lot longer, with something like 60-70 values, but you get the point.
What I want to do is add a little extra constraint on to this vector. When I sample from this vector, the values are classified according to their range. So values from 1-15 are category 1, 16-30 are category 2, and so on. The reasons for this are unimportant, but it is a pretty important part of the program. So if you look at the values I provided above, you see a section like this:
7
41
18
29
7
This is actually bad for my program. Because the value ranges are treated separately, 41, 18, and 29 are used differently than 7 is. So, for all intents and purposes, 7 is appearing consecutively in my script. What I want to do is somehow parse/modify/whatever the vector when it is generated so that the same number from a certain range cannot appear twice "in a row," regardless of how many other numbers from different ranges are between them. Does this make sense/did I describe this well? So, I want MATLAB to search the vector, and for all values within certain ranges (1-15,16-30,31-45,46-60) make sure that "consecutive" values from the same range are not identical.
So, then, that is what I want to do. This may not by any means be the best way to do this, so any advice/alternatives are, of course, appreciated. I know I can do this better with multiple vectors, but for various reasons I need this to be a single, long vector (the way my script is designed it just wouldn't work if I had a separate vector for each range of values).

What you may want to do is create four random vectors, one for each category, ensure that they do not contain any two consecutive equal values, and then build your final random vector by ordered picking of values from random categories, i.e.
%# make a 50-by-nCategories array of random numbers
categories = [1,16,31,46;15,30,45,60]; %# category min/max
nCategories = size(categories,2);
randomCategories = zeros(50,nCategories);
for c=1:nCategories
%# draw 100 numbers for good measure
tmp = randi(categories(:,c),[100 1]);
tmp(diff(tmp==0)) = []; %# remove consecutive duplicates
%# store
randomCategories(:,c) = tmp(1:50);
end
%# select from which bins to pick. Use half the numbers, so that we don't force the
%# numbers of entries per category to be exactly equal
bins = randi(nCategories,[100,1]);
%# combine the output, i.e. replace e.g. the numbers
%# '3' in 'bins' with the consecutive entries
%# from the third category
out = zeros(100,1);
for c = 1:nCategories
cIdx = find(bins==c);
out(cIdx) = randomCategories(1:length(cIdx),c);
end

First we assign each element the bin number of the range it lies into:
[~,bins] = histc(randvec, [1 16 31 46 61]);
Next we loop for each range, and find elements in those categories. For example for the first range of 1-16, we get:
>> ind = find(bins==1); %# bin#1 of 1-16
>> x = randvec(ind)
ans =
2
14
6
12
7
7
now you can apply the same process of removing consecutive duplicates:
>> idx = ([1;diff(x)] == 0)
idx =
0
0
0
0
0
1
>> problematicIndices = ind(idx) %# indices into the vector: randvec
Do this for all ranges, and collect those problematic indices. Next decide how you want to deal with them (remove them, generate other numbers in their place, etc...)

If I understand your problem correct, I think that is one solution. It uses unique, but applies it to each of the subranges of the vector. The values that are duplicated within a range of indices are identified so you can deal with them.
cat_inds = [1,16,31,46,60]; % need to include last element
for i=2:numel(cat_inds)
randvec_part = randvec( cat_inds(i-1):cat_inds(i) );
% Find the indices for the first unique elements in this part of the array
[~,uniqInds] = unique(randvec_part,'first');
% this binary vector identifies the indices that are duplicated in
% this part of randvec
%
% NB: they are indices into randvec_part
%
inds_of_duplicates = ~ismember(1:numel(randvec_part), uniqInds);
% code to deal with the problem indices goes here. Modify randvec_part accordingly...
% Write it back to the original vector (assumes that the length is the same)
randvec( cat_inds(i-1):cat_inds(i) ) = randvec_part;
end

Here's a different approach than what everyone else has been tossing up. The premise that I'm working on here is that you want to have a random arrangement of values in a vector without repitition. I'm not sure what other constraints you are applying prior to the point where we are giving out input.
My thoughts is to use the randperm function.
Here's some sample code how it would work:
%randvec is your vector of random values
randvec2 = unique(randvec); % This will return the sorted list of values from randvec.
randomizedvector = randvec2(randperm(length(randvec2));
% Note: if randvec is multidimensional you'll have to use numel instead of length
At this point randomizedvector should contain all the unique values from the initial randvec and but 'shuffled' or re-randomized after the unique function call. Now you could just seed the randvec differently to avoid needing the unique function call as simply calling randperm(n) will returning a randomized vector with values ranging from 1 to n.
Just an off the wall 2 cents there =P enjoy!

Extremely large weighted average

I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)

The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!

This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html

I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);

You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).

The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse