I have several variables, all of which are numeric ranges: (intervals in rows)
a = [ 1 4; 5 9; 11 15; 20 30];
b = [ 2 6; 12 14; 19 22];
c = [ 15 22; 24 29; 33 35];
d = [ 0 3; 15 17; 23 26];
(The values in my real dataset are not integers, but are represented here as such for clarity).
I would like to find intervals in which at least 3 of the variables intersect. In the above example [20 22] and [24 26] would be two such cases.
One way to approach this is to bin my values and add the bins together, but as my values are continuous, that'd create an 'edge effect', and I'd lose time binning the values in the first place. (binning my dataset at my desired resolution would create hundreds of GB of data).
Another approach, which doesn't involve binning, would use pairwise intersections (let's call it X) between all possible combinations of variables, and then the intersection of X with all other variables O(n^3).
What are your thoughts on this? Are there algorithms/libraries which have tools to solve this?
I was thinking of using sort of a geometric approach to solve this: Basically, if I considered that my intervals were segments in 1D space, then my desired output would be points where three segments (from three variables) intersect. I'm not sure if this is efficient algorithmically though. Advice?
O(N lg N) method:
Convert each interval (t_A, t_B) to a pair of tagged endpoints ('begin', t_A), ('end', t_B)
Sort all the endpoints by time, this is the most expensive step
Do one pass through, tracking nesting depth (increment if tag is 'start', decrement if tag is 'end'). This takes linear time.
When the depth changes from 2 to 3, it's the start of an output interval.
When it changes from 3 to 2, it's the end of an interval.
Related
I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.
Okay, this is a bit tricky to explain, but I have a long .txt file with data (only one column). It could look like this:
data=[18
32
50
3
19
31
48
2
18
33
51
4]
Now, every fourth value (e.g. 18, 19, 18) represents the same physical quantity, just from different measurements. Now, I want Matlab to take every fourth value and put it into an array X=[18 19 18], and like wise for the other quantities.
My solution so far looks like this:
for i=1:3;
for j=1:4:12;
X(i)=data(j);
end
end
... in this example, because there are three of each quantity (therefore i=1:3), and there are 12 datapoints in total (therefore j=1:4:12, in steps of 4). data is simply the loaded list of datapoints (this works fine, I can test it in command window - e.g. data(2)=32).
My problem, doing this, is, that my array turns out like X=[18 18 18] - i.e. only the last iteration is put into the array
Of course, in the end, I would like to do it for all points; saving the 2nd, 6th, and 10th datapoint into Y and so on. But this is simply having more for-loops I guess.
I hope this question makes sense. I guess it is an easy problem to solve.
Why don't you just do?
>> X = data(1:4:end)
X =
18
19
18
>> Y = data(2:4:end)
Y =
32
31
33
You can reshape the data and then either split it up into different variables or just know that each column is a different variable (I'm now assuming each measurement occurs the same number of times i.e. length(data) is a multiple of 4)
data = reshape(data, 4, []).';
So now if you want
X = data(:,1);
Y = data(:,2);
%// etc...
But also you could just leave it as data all in one variable since calling data(:,1) is hardly more hassle than X.
Now, you should NOT use for-loops for this, but I'm gong to address what's wrong with your loops and how to solve this using loops purely as an explanation of the logic. You have a nested loop:
for i=1:3;
for j=1:4:12;
X(i)=data(j);
end
end
Now what you were hoping was that i and j would each move one iteration forward together. So when i==1 then j==1, when i==2 then j==5 etc but this is not what happens at all. To best understand what's going on I suggest you print out the variables at each iteration:
disp(sprintf('i: \tj:'));
for i=1:3;
for j=1:4:12;
disp(sprintf(' %d\t %d',i,j));
end
end
This prints out
i: j:
1 1
1 5
1 9
2 1
2 5
2 9
3 1
3 5
3 9
What you wanted was
disp(sprintf('i: \tj:'));
for i=1:3;
disp(sprintf(' %d\t %d',i,4*i-3));
end
which outputs:
i: j:
1 1
2 5
3 9
applied to your problem:
%// preallocation!
X = zeros(size(data,1)/4, 1)
for i=1:3
X(i)=data(i*4 - 3);
end
Or alternatively you can keep a separate count of either i or j:
%// preallocation!
X = zeros(size(data,1)/4, 1)
i = 1;
for j=1:4:end;
X(i)=data(j);
i = i+1;
end
Just for completeness your own solution should have read
i = 0;
for j=1:4:12;
i = i+1;
X(i)=data(j);
end
Of course am304's answer is a better way of doing it.
For my experiment I have 20 categories which contain 9 pictures each. I want to show these pictures in a pseudo-random sequence where the only constraint to randomness is that one image may not be followed directly by one of the same category.
So I need something similar to
r = randi([1 20],1,180);
just with an added constraint of two numbers not directly following each other. E.g.
14 8 15 15 7 16 6 4 1 8 is not legitimate, whereas
14 8 15 7 15 16 6 4 1 8 would be.
An alternative way I was thinking of was naming the categories A,B,C,...T, have them repeat 9 times and then shuffle the bunch. But there you run into the same problem I think?
I am an absolute Matlab beginner, so any guidance will be welcome.
The following uses modulo operations to make sure each value is different from the previous one:
m = 20; %// number of categories
n = 180; %// desired number of samples
x = [randi(m)-1 randi(m-1, [1 n-1])];
x = mod(cumsum(x), m) + 1;
How the code works
In the third line, the first entry of x is a random value between 0 and m-1. Each subsequent entry represents the change that, modulo m, will give the next value (this is done in the fourth line).
The key is to choose that change between 1 and m-1 (not between 0 and m-1), to assure consecutive values will be different. In other words, given a value, there are m-1 (not m) choices for the next value.
After the modulo operation, 1 is added to to transform the range of resulting values from 0,...,m-1 to 1,...,m.
Test
Take all (n-1) pairs of consecutive entries in the generated x vector and count occurrences of all (m^2) possible combinations of values:
count = accumarray([x(1:end-1); x(2:end)].', 1, [m m]);
imagesc(count)
axis square
colorbar
The following image has been obtained for m=20; n=1e6;. It is seen that all combinations are (more or less) equally likely, except for pairs with repeated values, which never occur.
You could look for the repetitions in an iterative manner and put new set of integers from the same group [1 20] only into those places where repetitions have occurred. We continue to do so until there are no repetitions left -
interval = [1 20]; %// interval from where the random integers are to be chosen
r = randi(interval,1,180); %// create the first batch of numbers
idx = diff(r)==0; %// logical array, where 1s denote repetitions for first batch
while nnz(idx)~=0
idx = diff(r)==0; %// logical array, where 1s denote repetitions for
%// subsequent batches
rN = randi(interval,1,nnz(idx)); %// new set of random integers to be placed
%// at the positions where repetitions have occured
r(find(idx)+1) = rN; %// place ramdom integers at their respective positions
end
I have two matrices like this:
gt = [30 40 20 40] and
de = [32 42 20 40; 34 12 20 40; 36 84 20 40]
I want to calculate the overlap area between gt and 3 rows of de respectively and the overlap is calculated by a function I write myself. Then I want to store the result in a new column vector like
result = [result1; result2; result3].
Could you tell me how to write a vectorization codes to achieve this?
Thanks!
The vectorization can only happen inside the overlap function. The only thing you can do outside it is replicate the vector gt, using repmat or bsxfun. You don't explain how the overlap function works. I suppose it has to do with co-ordinates, so I give an example for euclidean distance which works in a similar logic.
If you had to calculate the distance between point gt = [1 2] and points de = [5 6; 10 12; 0 -1] you would define
function result = dist(x, y)
result = sum(sqrt((x(:,1) - y(:,1)).^2 + (x(:,2) - y(:,2)).^2), 2)
and you would call it replicating the gt vector
dist(de, repmat(gt, 3, 1))
Alternatively, you could use bsxfun instead of repmat, which might have better performance (depending on various factors)
The key to vectorizing is performing operations column-wise (in this specific case it could be vectorized even further, however I am writing it this way to emphasize the column-wise operations)
Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).