MATLAB/General CS: Sampling Without Replacement From Multiple Sets (+Keeping Track of Unsampled Cases) - matlab

I currently implementing an optimization algorithm that requires me to sample without replacement from several sets. Although I am coding in MATLAB, this is essentially a CS question.
The situation is as follows:
I have a finite number of sets (A, B, C) each with a finite but possibly different number of elements (a1,a2...a8, b1,b2...b10, c1, c2...c25). I also have a vector of probabilities for each set which lists a probability for each element in that set (i.e. for set A, P_A = [p_a1 p_a2... p_a8] where sum(P_A) = 1). I normally use these to create a probability generating function for each set, which given a uniform number between 0 to 1, can spit out one of the elements from that set (i.e. a function P_A(u), which given u = 0.25, will select a2).
I am looking to sample without replacement from the sets A, B, and C. Each "full sample" is a sequence of elements from each of the different sets i.e. (a1, b3, c2). Note that the space of full samples is the set of all permutations of the elements in A, B, and C. In the example above, this space is (a1,a2...a8) x (b1,b2...b10) x (c1, c2...c25) and there are 8*10*25 = 2000 unique "full samples" in my space.
The annoying part of sampling without replacement with this setup is that if my first sample is (a1, b3, c2) then that does not mean I cannot sample the element a1 again - it just means that I cannot sample the full sequence (a1, b3, c2) again. Another annoying part is that the algorithm I am working with requires me do a function evaluation for all permutations of elements that I have not sampled.
The best method at my disposal right now is to keep track the sampled cases. This is a little inefficient since my sampler is forced to reject any case that has been sampled before (since I'm sampling without replacement). I then do the function evaluations for the unsampled cases, by going through each permutation (ax, by, cz) using nested for loops and only doing the function evaluation if that combination of (ax, by, cz) is not included in the sampled cases. Again, this is a little inefficient since I have to "check" whether each permutation (ax, by, cz) has already been sampled.
I would appreciate any advice in regards to this problem. In particular, I am looking a method to sample without replacement and keep track of unsampled cases that does not explicity list out the full sample space (I usually work with 10 sets with 10 elements each so listing out the full sample space would require a 10^10 x 10 matrix). I realize that this may be impossible, though finding efficient way to do it will allow me to demonstrate the true limits of the algorithm.

Do you really need to keep track of all of the unsampled cases? Even if you had a 1-by-1010 vector that stored a logical value of true or false indicating if that permutation had been sampled or not, that would still require about 10 GB of storage, and MATLAB is likely to either throw an "Out of Memory" error or bring your entire machine to a screeching halt if you try to create a variable of that size.
An alternative to consider is storing a sparse vector of indicators for the permutations you've already sampled. Let's consider your smaller example:
A = 1:8;
B = 1:10;
C = 1:25;
nA = numel(A);
nB = numel(B);
nC = numel(C);
beenSampled = sparse(1,nA*nB*nC);
The 1-by-2000 sparse matrix beenSampled is empty to start (i.e. it contains all zeroes) and we will add a one at a given index for each sampled permutation. We can get a new sample permutation using the function RANDI to give us indices into A, B, and C for the new set of values:
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
We can then convert these three indices into a single unique linear index into beenSampled using the function SUB2IND:
index = sub2ind([nA nB nC],indexA,indexB,indexC);
Now we can test the indexed element in beenSampled to see if it has a value of 1 (i.e. we sampled it already) or 0 (i.e. it is a new sample). If it has been sampled already, we repeat the process of finding a new set of indices above. Once we have a permutation we haven't sampled yet, we can process it:
while beenSampled(index)
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
index = sub2ind([nA nB nC],indexA,indexB,indexC);
end
beenSampled(index) = 1;
newSample = [A(indexA) B(indexB) C(indexC)];
%# ...do your subsequent processing...
The use of a sparse array will save you a lot of space if you're only going to end up sampling a small portion of all of the possible permutations. For smaller total numbers of permutations, like in the above example, I would probably just use a logical vector instead of a sparse vector.

Check the matlab documentation for the randi function; you'll just want to use that in conjunction with the length function to choose random entries from each vector. Keeping track of each sampled vector should be as simple as just concatenating it to a matrix;
current_values = [5 89 45]; % lets say this is your current sample set
used_values = [used_values; current_values];
% wash, rinse, repeat

Related

How to perform operations along a certain dimension of an array?

I have a 3D array containing five 3-by-4 slices, defined as follows:
rng(3372061);
M = randi(100,3,4,5);
I'd like to collect some statistics about the array:
The maximum value in every column.
The mean value in every row.
The standard deviation within each slice.
This is quite straightforward using loops,
sz = size(M);
colMax = zeros(1,4,5);
rowMean = zeros(3,1,5);
sliceSTD = zeros(1,1,5);
for indS = 1:sz(3)
sl = M(:,:,indS);
sliceSTD(indS) = std(sl(1:sz(1)*sz(2)));
for indC = 1:sz(1)
rowMean(indC,1,indS) = mean(sl(indC,:));
end
for indR = 1:sz(2)
colMax(1,indR,indS) = max(sl(:,indR));
end
end
But I'm not sure that this is the best way to approach the problem.
A common pattern I noticed in the documentation of max, mean and std is that they allow to specify an additional dim input. For instance, in max:
M = max(A,[],dim) returns the largest elements along dimension dim. For example, if A is a matrix, then max(A,[],2) is a column vector containing the maximum value of each row.
How can I use this syntax to simplify my code?
Many functions in MATLAB allow the specification of a "dimension to operate over" when it matters for the result of the computation (several common examples are: min, max, sum, prod, mean, std, size, median, prctile, bounds) - which is especially important for multidimensional inputs. When the dim input is not specified, MATLAB has a way of choosing the dimension on its own, as explained in the documentation; for example in max:
If A is a vector, then max(A) returns the maximum of A.
If A is a matrix, then max(A) is a row vector containing the maximum value of each column.
If A is a multidimensional array, then max(A) operates along the first array dimension whose size does not equal 1, treating the elements as vectors. The size of this dimension becomes 1 while the sizes of all other dimensions remain the same. If A is an empty array whose first dimension has zero length, then max(A) returns an empty array with the same size as A.
Then, using the ...,dim) syntax we can rewrite the code as follows:
rng(3372061);
M = randi(100,3,4,5);
colMax = max(M,[],1);
rowMean = mean(M,2);
sliceSTD = std(reshape(M,1,[],5),0,2); % we use `reshape` to turn each slice into a vector
This has several advantages:
The code is easier to understand.
The code is potentially more robust, being able to handle inputs beyond those it was initially designed for.
The code is likely faster.
In conclusion: it is always a good idea to read the documentation of functions you're using, and experiment with different syntaxes, so as not to miss similar opportunities to make your code more succinct.

Finding the best threshold level without a for loop in MATLAB

Let A and B be two matrices of the same size. For a matrix M, let ht(M,t) threshold all the entries of M by t. That is All entries whose absolute value is less than t are set to 0. Suppose I want to find the optimal threshold t such that norm(ht(A,t)-B,'fro')^2 is minimized.
The only way that I can see to do this is deficient: do a for loop over the unique values of A and threshold A and setting C=ht(A,t)-B, compute sum(sum(C.*C)).
This is just too slow when A is large. I have considered sorting the elements of A and finding some efficient way to set a few entries to zero at a time, but I'm not sure this can all be done without a for loop.
Is there a way to do it?
Here's a very simple example (so simple a for loop works easily in this case):
B =
0.101508820368332 0
0 0.301996943246957
Set
A=B+.1*ones(2)
A =
0.201508820368332 0.1
0.1 0.401996943246957
Simple inspection shows that if we zero out the off-diagonal entries of A we minimize the difference between A and B. There are 3 possible threshold values, given by unique(A)=[.1,.2015,.402]. Given a potential threshold value t, we can hard threshold A by:
function [A_thresholded] = ht(A,t)
%
A_thresholded = A .* (abs(A)>t);
The form of the data in a matrix is irrelevant. You can convert them to vectors and simply compute the square-norm. In fact, you can sort the contents of A in increasing order (and permute B to preserve pairing). When you increase the threshold to include one more value in A, the norm only changes by that one increment. Therefore, you can find your solution in O(n log n). Hope this helps.

Retrieve a specific permutation without storing all possible permutations in Matlab

I am working on 2D rectangular packing. In order to minimize the length of the infinite sheet (Width is constant) by changing the order in which parts are placed. For example, we could place 11 parts in 11! ways.
I could label those parts and save all possible permutations using perms function and run it one by one, but I need a large amount of memory even for 11 parts. I'd like to be able to do it for around 1000 parts.
Luckily, I don't need every possible sequence. I would like to index each permutation to a number. Test a random sequence and then use GA to converge the results to find the optimal sequence.
Therefore, I need a function which gives a specific permutation value when run for any number of times unlike randperm function.
For example, function(5,6) should always return say [1 4 3 2 5 6] for 6 parts. I don't need the sequences in a specific order, but the function should give the same sequence for same index. and also for some other index, the sequence should not be same as this one.
So far, I have used randperm function to generate random sequence for around 2000 iterations and finding a best sequence out of it by comparing length, but this works only for few number of parts. Also using randperm may result in repeated sequence instead of unique sequence.
Here's a picture of what I have done.
I can't save the outputs of randperm because I won't have a searchable function space. I don't want to find the length of the sheet for all sequences. I only need do it for certain sequence identified by certain index determined by genetic algorithm. If I use randperm, I won't have the sequence for all indexes (even though I only need some of them).
For example, take some function, 'y = f(x)', in the range [0,10] say. For each value of x, I get a y. Here y is my sheet length. x is the index of permutation. For any x, I find its sequence (the specific permutation) and then its corresponding sheet length. Based on the results of some random values of x, GA will generate me a new list of x to find a more optimal y.
I need a function that duplicates perms, (I guess perms are following the same order of permutations each time it is run because perms(1:4) will yield same results when run any number of times) without actually storing the values.
Is there a way to write the function? If not, then how do i solve my problem?
Edit (how i approached the problem):
In Genetic Algorithm, you need to crossover parents(permutations), But if you crossover permutations, you will get the numbers repeated. for eg:- crossing over 1 2 3 4 with 3 2 1 4 may result something like 3 2 3 4. Therefore, to avoid repetition, i thought of indexing each parent to a number and then convert the number to binary form and then crossover the binary indices to get a new binary number then convert it back to decimal and find its specific permutation. But then later on, i discovered i could just use ordered crossover of the permutations itself instead of crossing over their indices.
More details on Ordered Crossover could be found here
Below are two functions that together will generate permutations in lexographical order and return the nth permutation
For example, I can call
nth_permutation(5, [1 2 3 4])
And the output will be [1 4 2 3]
Intuitively, how long this method takes is linear in n. The size of the set doesn't matter. I benchmarked nth_permutations(n, 1:1000) averaged over 100 iterations and got the following graph
So timewise it seems okay.
function [permutation] = nth_permutation(n, set)
%%NTH_PERMUTATION Generates n permutations of set in lexographical order and
%%outputs the last one
%% set is a 1 by m matrix
set = sort(set);
permutation = set; %First permutation
for ii=2:n
permutation = next_permute(permutation);
end
end
function [p] = next_permute(p)
%Following algorithm from https://en.wikipedia.org/wiki/Permutation#Generation_in_lexicographic_order
%Find the largest index k such that p[k] < p[k+1]
larger = p(1:end-1) < p(2:end);
k = max(find(larger));
%If no such index exists, the permutation is the last permutation.
if isempty(k)
display('Last permutation reached');
return
end
%Find the largest index l greater than k such that p[k] < p[l].
larger = [false(1, k) p(k+1:end) > p(k)];
l = max(find(larger));
%Swap the value of p[k] with that of p[l].
p([k, l]) = p([l, k]);
%Reverse the sequence from p[k + 1] up to and including the final element p[n].
p(k+1:end) = p(end:-1:k+1);
end

Matlab: Comparing two vectors with different length and different values?

Lets say I have two vectors A and B with different lengths Length(A) is not equal to Length(B) and the Values in Vector A, are not the same as in Vector B. I want to compare each value of B with Values of A (Compare means if Value B(i) is almost the same value of A(1:end) for example B(i)-Tolerance<A(i)<B(i)+Tolerance.
How Can I do this without using for loop since the data is huge?
I know ismember(F), intersect,repmat,find but non of those function can really help me
You may try a solution along these lines:
tol = 0.1;
N = 1000000;
a = randn(1, N)*1000; % create a randomly
b = a + tol*rand(1, N); % b is "tol" away from a
a_bin = floor(a/tol);
b_bin = floor(b/tol);
result = ismember(b_bin, a_bin) | ...
ismember(b_bin, a_bin-1) | ...
ismember(b_bin, a_bin+1);
find(result==0) % should be empty matrix.
The idea is to discretize the a and b variables to bins of size tol. Then, you ask whether b is found in the same bin as any element from a, or in the bin to the left of it, or in the bin to the right of it.
Advantages: I believe ismember is clever inside, first sorting the elements of a and then performing sublinear (log(N)) search per element b. This is unlike approaches which explicitly construct differences of each element in b with elements from a, meaning the complexity is linear in the number of elements in a.
Comparison: for N=100000 this runs 0.04s on my machine, compared to 20s using linear search (timed using Alan's nice and concise tf = arrayfun(#(bi) any(abs(a - bi) < tol), b); solution).
Disadvantages: this leads to that the actual tolerance is anything between tol and 1.5*tol. Depends on your task whether you can live with that (if the only concern is floating point comparison, you can).
Note: whether this is a viable approach depends on the ranges of a and b, and value of tol. If a and b can be very big and tol is very small, the a_bin and b_bin will not be able to resolve individual bins (then you would have to work with integral types, again checking carefully that their ranges suffice). The solution with loops is a safer one, but if you really need speed, you can invest into optimizing the presented idea. Another option, of course, would be to write a mex extension.
It sounds like what you are trying to do is have an ismember function for use on real valued data.
That is, check for each value B(i) in your vector B whether B(i) is within the tolerance threshold T of at least one value in your vector A
This works out something like the following:
tf = false(1, length(b)); %//the result vector, true if that element of b is in a
t = 0.01; %// the tolerance threshold
for i = 1:length(b)
%// is the absolute difference between the
%//element of a and b less that the threshold?
matches = abs(a - b(i)) < t;
%// if b(i) matches any of the elements of a
tf(i) = any(matches);
end
Or, in short:
t = 0.01;
tf = arrayfun(#(bi) any(abs(a - bi) < t), b);
Regarding avoiding the for loop: while this might benefit from vectorization, you may also want to consider looking at parallelisation if your data is that huge. In that case having a for loop as in my first example can be handy since you can easily do a basic version of parallel processing by changing the for to parfor.
Here is a fully vectorized solution. Note that I would actually recommend the solution given by #Alan, as mine is not likely to work for big datasets.
[X Y]=meshgrid(A,B)
M=abs(X-Y)<tolerance
Now the logical index of elements in a that are within the tolerance can be obtained with any(M) and the index for B is found by any(M,2)
bsxfun to the rescue
>> M = abs( bsxfun(#minus, A, B' ) ); %//' difference
>> M < tolerance
Another way to do what you want is with a logical expression.
Since A and B are vectors of different sizes you can't simply subtract and look for values that are smaller than the tolerance, but you can do the following:
Lmat = sparse((abs(repmat(A,[numel(B) 1])-repmat(B',[1 numel(A)])))<tolerance);
and you will get a sparse logical matrix with as many ones in it as equal elements (within tolerance). You could then count how many of those elements you have by writing:
Nequal = sum(sum(Lmat));
You could also get the indexes of the corresponding elements by writing:
[r,c] = find(Lmat);
then the following code will be true (for all j in numel(r)):
B(r(j))==A(c(j))
Finally, you should note that this way you get multiple counts in case there are duplicate entries in A or in B. It may be advisable to use the unique function first. For example:
A_new = unique(A);

Matlab fast neighborhood operation

I have a Problem. I have a Matrix A with integer values between 0 and 5.
for example like:
x=randi(5,10,10)
Now I want to call a filter, size 3x3, which gives me the the most common value
I have tried 2 solutions:
fun = #(z) mode(z(:));
y1 = nlfilter(x,[3 3],fun);
which takes very long...
and
y2 = colfilt(x,[3 3],'sliding',#mode);
which also takes long.
I have some really big matrices and both solutions take a long time.
Is there any faster way?
+1 to #Floris for the excellent suggestion to use hist. It's very fast. You can do a bit better though. hist is based on histc, which can be used instead. histc is a compiled function, i.e., not written in Matlab, which is why the solution is much faster.
Here's a small function that attempts to generalize what #Floris did (also that solution returns a vector rather than the desired matrix) and achieve what you're doing with nlfilter and colfilt. It doesn't require that the input have particular dimensions and uses im2col to efficiently rearrange the data. In fact, the the first three lines and the call to im2col are virtually identical to what colfit does in your case.
function a=intmodefilt(a,nhood)
[ma,na] = size(a);
aa(ma+nhood(1)-1,na+nhood(2)-1) = 0;
aa(floor((nhood(1)-1)/2)+(1:ma),floor((nhood(2)-1)/2)+(1:na)) = a;
[~,a(:)] = max(histc(im2col(aa,nhood,'sliding'),min(a(:))-1:max(a(:))));
a = a-1;
Usage:
x = randi(5,10,10);
y3 = intmodefilt(x,[3 3]);
For large arrays, this is over 75 times faster than colfilt on my machine. Replacing hist with histc is responsible for a factor of two speedup. There is of course no input checking so the function assumes that a is all integers, etc.
Lastly, note that randi(IMAX,N,N) returns values in the range 1:IMAX, not 0:IMAX as you seem to state.
One suggestion would be to reshape your array so each 3x3 block becomes a column vector. If your initial array dimensions are divisible by 3, this is simple. If they don't, you need to work a little bit harder. And you need to repeat this nine times, starting at different offsets into the matrix - I will leave that as an exercise.
Here is some code that shows the basic idea (using only functions available in FreeMat - I don't have Matlab on my machine at home...):
N = 100;
A = randi(0,5*ones(3*N,3*N));
B = reshape(permute(reshape(A,[3 N 3 N]),[1 3 2 4]), [ 9 N*N]);
hh = hist(B, 0:5); % histogram of each 3x3 block: bin with largest value is the mode
[mm mi] = max(hh); % mi will contain bin with largest value
figure; hist(B(:),0:5); title 'histogram of B'; % flat, as expected
figure; hist(mi-1, 0:5); title 'histogram of mi' % not flat?...
Here are the plots:
The strange thing, when you run this code, is that the distribution of mi is not flat, but skewed towards smaller values. When you inspect the histograms, you will see that is because you will frequently have more than one bin with the "max" value in it. In that case, you get the first bin with the max number. This is obviously going to skew your results badly; something to think about. A much better filter might be a median filter - the one that has equal numbers of neighboring pixels above and below. That has a unique solution (while mode can have up to four values, for nine pixels - namely, four bins with two values each).
Something to think about.
Can't show you a mex example today (wrong computer); but there are ample good examples on the Mathworks website (and all over the web) that are quite easy to follow. See for example http://www.shawnlankton.com/2008/03/getting-started-with-mex-a-short-tutorial/