Permuting n elements by swapping each element by no more than k positions - matlab

What I have is a vector (n = 4 in the example):
x = '0123';
What I want is a vector y of the same size of x and with the same elements as in x in different order:
y = ['0123'; '0132'; '0213'; '0231'; '0312'; '0321'; '1023'; '1032'; '1203'; '1302'; '2013'; '2031'; '2103'; '2301'];
y(ceil(rand * numel(y(:, 1))), :)
i.e. a permutation such that each element in y is allowed to randomly change no more than k positions with respect to its original position in x (k = 2 in the example). The probability distribution must be uniform (i.e. each permutation must be equally likely to occur).
An obvious but inefficient way to do it is of course to find a random unconstrained permutation and check ex post whether or not this happens to respect the constraint. For small vectors you can find all the permutations, delete those that are not allowed and randomly pick among the remaining ones.
Any idea about how to do the same more efficiently, for example by actually swapping the elements?

Generating all the permutations can be done easily using constraint programming. Here is a short model using MiniZinc for the above example (note that we assume that x will contain n different values here):
include "globals.mzn";
int: k = 2;
int: n = 4;
array[1..n] of int: x = [0, 1, 2, 3];
array[1..n] of var int: y;
constraint forall(i in 1..n) (
y[i] in {x[i + offset] | offset in -min(k, i-1)..min(k, n-i)}
);
constraint all_different(y);
solve :: int_search(y, input_order, indomain_min, complete)
satisfy;
output [show(y)];
In most cases, constraint programming systems have the possibility to use a random search. However, this would not give you a uniform distribution of the results. Using CP will however generate all valid permutations more efficiently than the naive method (generate and test for validity).
If you need to generate a random permutation of your kind efficiently, I think that it would be possible to modify the standard Fisher-Yates shuffle to handle it directly. The standard algorithm uses the rest of the array to choose the next value from, and chooses the value with a probability distribution that is uniform. It should be possible to keep a list of only the currently valid choices, and to change the probability distribution of the values to match the desired output.

I don't see any approach other than the rejection method that you mention. However, instead of listing all allowed permutations and then picking one, it's more efficient to avoid that listing. Thus, you can randomly generate a permutation, check if it's valid, and repeat if it's not:
x = '0123';
k = 2;
n = numel(x);
done = 0;
while ~done
perm = randperm(n);
done = all( abs(perm-(1:n)) <= k ); %// check condition
end
y = x(perm);

Related

Generating all ordered samples with replacement

I would like to generate an array which contains all ordered samples of length k taken from a set of n elements {a_1,...,a_n}, that is all the k-tuples (x_1,...,x_k) where each x_j can be any of the a_i (repetition of elements is allowed), and whose total number is n^k.
Is there a built-in function in Matlab to obtain it?
I have tried to write a code that iteratively uses the datasample function, but I couldn't get what desired so far.
An alternative way to get all the tuples is based on k-base integer representation.
If you take the k-base representation of all integers from 0 to n^k - 1, it gives you all possible set of k indexes, knowing that these indexes start at 0.
Now, implementing this idea is quite straightforward. You can use dec2base if k is lower than 10:
X = A(dec2base(0:(n^k-1), k)-'0'+1));
For k between 10 and 36, you can still use dec2base but you must take care of letters as there is a gap in ordinal codes between '9' and 'A':
X = A(dec2base(0:(n^k-1), k)-'0'+1));
X(X>=17) = X(X>=17)-7;
Above 36, you must use a custom made code for retrieving the representation of the integer, like this one. But IMO you may not need this as 2^36 is quite huge.
What you are looking for is ndgrid: it generates the grid elements in any dimension.
In the case k is fixed at the moment of coding, get all indexes of all elements a this way:
[X_1, ..., X_k] = ndgrid(1:n);
Then build the matrix X from vector A:
X = [A(X_1(:)), ..., A(X_k(:))];
If k is a parameter, my advice would be to look at the code of ndgrid and adapt it in a new function so that the output is a matrix of values instead of storing them in varargout.
What about this solution, I don't know if it's as fast as yours, but do you think is correct?
function Y = ordsampwithrep(X,K)
%ordsampwithrep Ordered samples with replacement
% Generates an array Y containing in its rows all ordered samples with
% replacement of length K with elements of vector X
X = X(:);
nX = length(X);
Y = zeros(nX^K,K);
Y(1,:) = datasample(X,K)';
k = 2;
while k < nX^K +1
temprow = datasample(X,K)';
%checknew = find (temprow == Y(1:k-1,:));
if not(ismember(temprow,Y(1:k-1,:),'rows'))
Y(k,:) = temprow;
k = k+1;
end
end
end

Multiply two vectors with dimensions increasing along time

I have two vectors (called A and B) with length N. Then I need to multiply both of them, but as an "integration" process. Which means I have to multiply first A(1)*B(1), then A(1:2)*B(1:2), until A(1:N)*B(1:N). The result of multiplying booth vector is a number, since B is a column vector. I've done it with a for loop:
for k = 1:N
C(k) = A(1:k) * B(1:k).';
end
But I wanted to ask you if this is the best solution or there is any other option more time-efficient, since N is very large (about 110,000)
C = cumsum(A.*B)
does the same thing without for loop. As EBH suggested in the comments if you are not sure whether A and B have same orientation, then use
C = cumsum(A(:).*B(:))

MATLAB: Find abbreviated version of matrix that minimises sum of matrix elements

I have a 151-by-151 matrix A. It's a correlation matrix, so there are 1s on the main diagonal and repeated values above and below the main diagonal. Each row/column represents a person.
For a given integer n I will seek to reduce the size of the matrix by kicking people out, such that I am left with a n-by-n correlation matrix that minimises the total sum of the elements. In addition to obtaining the abbreviated matrix, I also need to know the row number of the people who should be booted out of the original matrix (or their column number - they'll be the same number).
As a starting point I take A = tril(A), which will remove redundant off-diagonal elements from the correlation matrix.
So, if n = 4 and we have the hypothetical 5-by-5 matrix above, it's very clear that person 5 should be kicked out of the matrix, since that person is contributing a lot of very high correlations.
It's also clear that person 1 should not be kicked out, since that person contributes a lot of negative correlations, and thus brings down the sum of the matrix elements.
I understand that sum(A(:)) will sum everything in the matrix. However, I'm very unclear about how to search for the minimum possible answer.
I noticed a similar question Finding sub-matrix with minimum elementwise sum, which has a brute force solution as the accepted answer. While that answer works fine there it's impractical for a 151-by-151 matrix.
EDIT: I had thought of iterating, but I don't think that truly minimizes the sum of elements in the reduced matrix. Below I have a 4-by-4 correlation matrix in bold, with sums of rows and columns on the edges. It's apparent that with n = 2 the optimal matrix is the 2-by-2 identity matrix involving Persons 1 and 4, but according to the iterative scheme I would have kicked out Person 1 in the first phase of iteration, and so the algorithm makes a solution that is not optimal. I wrote a program that always generated optimal solutions, and it works well when n or k are small, but when trying to make an optimal 75-by-75 matrix from a 151-by-151 matrix I realised my program would take billions of years to terminate.
I vaguely recalled that sometimes these n choose k problems can be resolved with dynamic programming approaches that avoid recomputing things, but I can't work out how to solve this, and nor did googling enlighten me.
I'm willing to sacrifice precision for speed if there's no other option, or the best program will take more than a week to generate a precise solution. However, I'm happy to let a program run for up to a week if it will generate a precise solution.
If it's not possible for a program to optimise the matrix within an reasonable timeframe, then I would accept an answer that explains why n choose k tasks of this particular sort can't be resolved within reasonable timeframes.
This is an approximate solution using a genetic algorithm.
I started with your test case:
data_points = 10; % How many data points will be generated for each person, in order to create the correlation matrix.
num_people = 25; % Number of people initially.
to_keep = 13; % Number of people to be kept in the correlation matrix.
to_drop = num_people - to_keep; % Number of people to drop from the correlation matrix.
num_comparisons = 100; % Number of times to compare the iterative and optimization techniques.
for j = 1:data_points
rand_dat(j,:) = 1 + 2.*randn(num_people,1); % Generate random data.
end
A = corr(rand_dat);
then I defined the functions you need to evolve the genetic algorithm:
function individuals = user1205901individuals(nvars, FitnessFcn, gaoptions, num_people)
individuals = zeros(num_people,gaoptions.PopulationSize);
for cnt=1:gaoptions.PopulationSize
individuals(:,cnt)=randperm(num_people);
end
individuals = individuals(1:nvars,:)';
is the individual generation function.
function fitness = user1205901fitness(ind, A)
fitness = sum(sum(A(ind,ind)));
is the fitness evaluation function
function offspring = user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
offspring=zeros(length(parents),nvars);
for cnt=1:length(parents)
original = thisPopulation(parents(cnt),:);
extraneus = setdiff(1:num_people, original);
original(fix(rand()*nvars)+1) = extraneus(fix(rand()*(num_people-nvars))+1);
offspring(cnt,:)=original;
end
is the function to mutate an individual
function children = user1205901crossover(parents, options, nvars, FitnessFcn, unused, thisPopulation)
children=zeros(length(parents)/2,nvars);
cnt = 1;
for cnt1=1:2:length(parents)
cnt2=cnt1+1;
male = thisPopulation(parents(cnt1),:);
female = thisPopulation(parents(cnt2),:);
child = union(male, female);
child = child(randperm(length(child)));
child = child(1:nvars);
children(cnt,:)=child;
cnt = cnt + 1;
end
is the function to generate a new individual coupling two parents.
At this point you can define your problem:
gaproblem2.fitnessfcn=#(idx)user1205901fitness(idx,A)
gaproblem2.nvars = to_keep
gaproblem2.options = gaoptions()
gaproblem2.options.PopulationSize=40
gaproblem2.options.EliteCount=10
gaproblem2.options.CrossoverFraction=0.1
gaproblem2.options.StallGenLimit=inf
gaproblem2.options.CreationFcn= #(nvars,FitnessFcn,gaoptions)user1205901individuals(nvars,FitnessFcn,gaoptions,num_people)
gaproblem2.options.CrossoverFcn= #(parents,options,nvars,FitnessFcn,unused,thisPopulation)user1205901crossover(parents,options,nvars,FitnessFcn,unused,thisPopulation)
gaproblem2.options.MutationFcn=#(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation) user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
gaproblem2.options.Vectorized='off'
open the genetic algorithm tool
gatool
from the File menu select Import Problem... and choose gaproblem2 in the window that opens.
Now, run the tool and wait for the iterations to stop.
The gatool enables you to change hundreds of parameters, so you can trade speed for precision in the selected output.
The resulting vector is the list of indices that you have to keep in the original matrix so A(garesults.x,garesults.x) is the matrix with only the desired persons.
If I have understood you problem statement, you have a N x N matrix M (which happens to be a correlation matrix), and you wish to find for integer n where 2 <= n < N, a n x n matrix m which minimises the sum over all elements of m which I denote f(m)?
In Matlab it is fairly easy and fast to obtain a sub-matrix of a matrix (see for example Removing rows and columns from matrix in Matlab), and the function f is relatively inexpensive to evaluate for n = 151. So why can't you implement an algorithm that solves this backwards dynamically in a program as below where I have sketched out the pseudocode:
function reduceM(M, n){
m = M
for (ii = N to n+1) {
for (jj = 1 to ii) {
val(jj) = f(m) where mhas column and row jj removed, f(X) being summation over all elements of X
}
JJ(ii) = jj s.t. val(jj) is smallest
m = m updated by removing column and row JJ(ii)
}
}
In the end you end up with an m of dimension n which is the solution to your problem and a vector JJ which contains the indices removed at each iteration (you should easily be able to convert these back to indices applicable to the full matrix M)
There are several approaches to finding an approximate solution (eg. quadratic programming on relaxed problem or greedy search), but finding the exact solution is an NP-hard problem.
Disclaimer: I'm not an expert on binary quadratic programming, and you may want to consult the academic literature for more sophisticated algorithms.
Mathematically equivalent formulation:
Your problem is equivalent to:
For some symmetric, positive semi-definite matrix S
minimize (over vector x) x'*S*x
subject to 0 <= x(i) <= 1 for all i
sum(x)==n
x(i) is either 1 or 0 for all i
This is a quadratic programming problem where the vector x is restricted to taking only binary values. Quadratic programming where the domain is restricted to a set of discrete values is called mixed integer quadratic programming (MIQP). The binary version is sometimes called Binary Quadratic Programming (BQP). The last restriction, that x is binary, makes the problem substantially more difficult; it destroys the problem's convexity!
Quick and dirty approach to finding an approximate answer:
If you don't need a precise solution, something to play around with might be a relaxed version of the problem: drop the binary constraint. If you drop the constraint that x(i) is either 1 or 0 for all i, then the problem becomes a trivial convex optimization problem and can be solved nearly instantaneously (eg. by Matlab's quadprog). You could try removing entries that, on the relaxed problem, quadprog assigns the lowest values in the x vector, but this does not truly solve the original problem!
Note also that the relaxed problem gives you a lower bound on the optimal value of the original problem. If your discretized version of the solution to the relaxed problem leads to a value for the objective function close to the lower bound, there may be a sense in which this ad-hoc solution can't be that far off from the true solution.
To solve the relaxed problem, you might try something like:
% k is number of observations to drop
n = size(S, 1);
Aeq = ones(1,n)
beq = n-k;
[x_relax, f_relax] = quadprog(S, zeros(n, 1), [], [], Aeq, beq, zeros(n, 1), ones(n, 1));
f_relax = f_relax * 2; % Quadprog solves .5 * x' * S * x... so mult by 2
temp = sort(x_relax);
cutoff = temp(k);
x_approx = ones(n, 1);
x_approx(x_relax <= cutoff) = 0;
f_approx = x_approx' * S * x_approx;
I'm curious how good x_approx is? This doesn't solve your problem, but it might not be horrible! Note that f_relax is a lower bound on the solution to the original problem.
Software to solve your exact problem
You should check out this link and go down to the section on Mixed Integer Quadratic Programming (MIQP). It looks to me that Gurobi can solve problems of your type. Another list of solvers is here.
Working on a suggestion from Matthew Gunn and also some advice at the Gurobi forums, I came up with the following function. It seems to work pretty well.
I will award it the answer, but if someone can come up with code that works better I'll remove the tick from this answer and place it on their answer instead.
function [ values ] = the_optimal_method( CM , num_to_keep)
%the_iterative_method Takes correlation matrix CM and number to keep, returns list of people who should be kicked out
N = size(CM,1);
clear model;
names = strseq('x',[1:N]);
model.varnames = names;
model.Q = sparse(CM); % Gurobi needs a sparse matrix as input
model.A = sparse(ones(1,N));
model.obj = zeros(1,N);
model.rhs = num_to_keep;
model.sense = '=';
model.vtype = 'B';
gurobi_write(model, 'qp.mps');
results = gurobi(model);
values = results.x;
end

Matlab: Find first item in a vector that satisfies criteria, then skip over 100+ values to find the next first

In Matlab R2010a:
I am familiar with finding values based on criteria as well as finding the first value in a vector that satisfies criteria. However, how does one find X's and not Y's in the following example? In this case, X's are the first values of a group of values that are findable given my criteria, and there are multiple groups like this amidst thousands of junk values.
I have an vector with 10,000 or more values. Let J be junk values, while X and Y are both values my find criteria will pick up. X's are interesting to me because they are the 'first' values of a series of values that satisfy my criteria before becoming J's. Assume that there are hundreds or thousands more J's in between the X's and Y's, but here is a small example
[J,J,J,J,J,J,J,J,J,J,J,X,Y,Y,Y,Y,J,J,J,J,J,J,J,J,J,X,Y,Y,Y,Y,J];
Assuming that you're not doing something to strange with those Xs and Ys, this is quite easy. You just need to find the beginning of each cluster:
% Create data using your example (Y can equal X, but we make it different)
J = 1; X = 2; Y = 3;
A = [J,J,J,J,J,J,J,J,J,J,J,X,Y,Y,Y,Y,J,J,J,J,J,J,J,J,J,X,Y,Y,Y,Y,J];
a0 = (A==X); % Logical indices of A that match X condition
start = find([a0(1) diff(a0)]==1) % Start index of each group beginning with X
vals = A(start) % Should all be equal to X
which returns
start =
12 26
vals =
2 2
The J values don't even need to be all the same, just not equal to what ever you're detecting as X. You might also find my answer to this similar question helpful.
A = [J,J,J,J,J,J,J,J,J,J,J,X,Y,Y,Y,Y,J,J,J,J,J,J,J,J,J,X,Y,Y,Y,Y,J]; % created vector
I = A(A~=J); % separated out all values that are not junk
V = I(I==I(1)); % separated all values that match the first non-junk value

MATLAB: Indexing a large matrix for Monte Carlo Simulation

I'm trying to index a large matrix in MATLAB that contains numbers monotonically increasing across rows, and across columns, i.e. if the matrix is called A, for every (i,j), A(i+1,j) > A(i,j) and A(i,j+1) > A(i,j).
I need to create a random number n and compare it with the values of the matrix A, to see where that random number should be placed in the matrix A. In other words, the value of n may not equal any of the contents of the matrix, but it may lie in between any two rows and any two columns, and that determines a "bin" that identifies its position in A. Once I find this position, I increment the corresponding index in a new matrix of the same size as A.
The problem is that I want to do this 1,000,000 times. I need to create a random number a million times and do the index-checking for each of these numbers. It's a Monte Carlo Simulation of a million photons coming from a point landing on a screen; the matrix A consists of angles in spherical coordinates, and the random number is the solid angle of each incident photon.
My code so far goes something like this (I haven't copy-pasted it here because the details aren't important):
for k = 1:1000000
n = rand(1,1)*pi;
for i = length(A(:,1))
for j = length(A(1,:))
if (n > A(i-1,j)) && (n < A(i+1,j)) && (n > A(i,j-1)) && (n < A(i,j+1))
new_img(i,j) = new_img(i,j) + 1; % new_img defined previously as zeros
end
end
end
end
The "if" statement is just checking to find the indices of A that form the bounds of n.
This works perfectly fine, but it takes ridiculously long, especially since my matrix A is an image of dimensions 11856 x 11000. is there a quicker / cleverer / easier way of doing this?
Thanks in advance.
You can get rid of the inner loops by performing the calculation on all elements of A at once. Also, you can create the random numbers all at once, instead of one at a time. Note that the outermost pixels of new_img can never be different from zero.
randomNumbers = rand(1,1000000)*pi;
new_img = zeros(size(A));
tmp_img = zeros(size(A)-2);
for r = randomNumbers
tmp_img = tmp_img + A(:,1:end-2)<r & A(:,3:end)>r & A(1:end-1,:)<r & A(3:end,:)>r;
end
new_img(2:end-1,2:end-1) = tmp_img;
/aside: If the arrays were smaller, I'd have used bsxfun for the comparison, but with the array sizes in the OP, the approach would run out of memory.
Are the values in A bin edges? Ie does A specify a grid? If this is the case then you can QUICKLY populate A using hist3.
Here is an example:
numRand = 1e
n = randi(100,1e6,1);
nMatrix = [floor(data./10), mod(data,10)];
edges = {0:1:9, 0:10:99};
A = hist3(dataMat, edges);
If your A doesn't specify a grid, then you should create all of your random values once and sort them. Then iterate through those values.
Because you know that n(i) >= n(i-1) you don't have to check bins that were too small for n(i-1). This is a very easy way to optimize away most redundant checks.
Here is a snippet that should help a lot in the inner loop, it finds the location of the greatest point that is smaller than your value.
idx1 = A<value
idx2 = A(idx1) == max(A(idx1))
if you want to find the exact location you can wrap it with a find.