Matlab: optimal-distance matching of the elements of two sets - matlab

I have two floating-point number vectors which contain the same values up to a small error, but not necessarily sorted in the same way; for instance, A=rand(10);a=eig(A);b=eig(A+1e-10); (remember that eig outputs eigenvalues in no specified order).
I need to find a permutation p that matches the corresponding elements, i.e. p=mysterious_function(a,b) such that norm(a-b(p)) is small.
Is there an existing function that does this in a sane and safe way, or do I really need to roll out my own slow and poorly-error-checked implementation?
I need this only for test purposes for now, it need not be excessively optimized. Notice that the solution which involves sorting both vectors with sort fails in case of vectors containing complex equal-modulus arguments, such as the typical output of eig().

You seem to want to solve the linear assignment problem. I haven't tested it myself, but this piece of code should help you.

I believe that the sort() solution you discarded might actually work for you; The criteria you have defined minimize norm(a-b) is, by definition, considering the modulus (absolute value) of the complex number: norm(a-b) == sqrt(sum(abs(a-b).^2))
And as you know, SORT orders complex numbers based on their absolute value: sort(a) is equivalent to sort(abs(a)) for complex input.
%# sort by complex-magnitude
[sort(a) sort(b)]
As long as the same order is applied to both, you might as well try lexicographic ordering (sort by real part, if equal, then sort by imaginary part):
%# lexicographic sort order
[~,ordA] = sortrows([real(a) imag(a)],[1 2]);
[~,ordB] = sortrows([real(b) imag(b)],[1 2]);
[b(ordB) a(ordA)]
If you are too lazy to implement the Hungarian algorithm that #AnthonyLabarre suggested, go for brute-forcing:
A = rand(5);
a = eig(A);
b = eig(A+1e-10);
bb = perms(b); %# all permutations of b
nrm = sqrt( sum(abs(bsxfun(#minus, a,bb')).^2) ); %'
[~,idx] = min(nrm); %# argmin norm(a-bb(i,:))
[bb(idx,:)' a]
Beside the fact that eigenvalues returned by EIG are not guaranteed to be sorted, there is another difficulty you have to deal with if you to match eigenvectors as well: they are not unique in the sense that if v is an eigenvector, then k*v is also one, especially for k=-1. Usually you would enforce a sign convention like: multiply by -/+1 so that the largest element in each vector have a positive sign.

Related

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

Reduced row echelon matrix equivalence

I've been trying to solve whether two m-by-n matrices A and B are equivalent via
[a,ja] = rref(A,tol)
[b,jb] = rref(B,tol)
and then comparing
isequal(a,b) & isequal(ja,jb)
First, I don't really understand what ja and jb are. My problem is that the reduced row echelon form is very simple for both A and B and identical in all cases. I don't know whether this is on purpose or not. For example, I get equivalence for just
A = rand(40,3)
B = rand(40,3)
which I'm not sure is nonsense or not.
It looks like you're trying to check if the reduced row echelon forms of two matrices are element-wise equivalent. From how you've defined A and B, they are (it's effectively an overdetermined system, I believe). However, I think that you may have flipped your rows and columns. If instead you create A and B such that there are more columns than rows (i.e., an augmented matrix of an underdetermined system):
A = rand(3,40)
B = rand(3,40)
then when you run rref, you'll see a much different output and your comparison will return false as you perhaps expected.
Also, I think that it is sufficient to use the following, as two matrices that are equal element-wise will surely share the same rank (or approximation thereof):
a = rref(A,tol);
b = rref(B,tol);
isequal(a,b)

Matlab: Comparing two vectors with different length and different values?

Lets say I have two vectors A and B with different lengths Length(A) is not equal to Length(B) and the Values in Vector A, are not the same as in Vector B. I want to compare each value of B with Values of A (Compare means if Value B(i) is almost the same value of A(1:end) for example B(i)-Tolerance<A(i)<B(i)+Tolerance.
How Can I do this without using for loop since the data is huge?
I know ismember(F), intersect,repmat,find but non of those function can really help me
You may try a solution along these lines:
tol = 0.1;
N = 1000000;
a = randn(1, N)*1000; % create a randomly
b = a + tol*rand(1, N); % b is "tol" away from a
a_bin = floor(a/tol);
b_bin = floor(b/tol);
result = ismember(b_bin, a_bin) | ...
ismember(b_bin, a_bin-1) | ...
ismember(b_bin, a_bin+1);
find(result==0) % should be empty matrix.
The idea is to discretize the a and b variables to bins of size tol. Then, you ask whether b is found in the same bin as any element from a, or in the bin to the left of it, or in the bin to the right of it.
Advantages: I believe ismember is clever inside, first sorting the elements of a and then performing sublinear (log(N)) search per element b. This is unlike approaches which explicitly construct differences of each element in b with elements from a, meaning the complexity is linear in the number of elements in a.
Comparison: for N=100000 this runs 0.04s on my machine, compared to 20s using linear search (timed using Alan's nice and concise tf = arrayfun(#(bi) any(abs(a - bi) < tol), b); solution).
Disadvantages: this leads to that the actual tolerance is anything between tol and 1.5*tol. Depends on your task whether you can live with that (if the only concern is floating point comparison, you can).
Note: whether this is a viable approach depends on the ranges of a and b, and value of tol. If a and b can be very big and tol is very small, the a_bin and b_bin will not be able to resolve individual bins (then you would have to work with integral types, again checking carefully that their ranges suffice). The solution with loops is a safer one, but if you really need speed, you can invest into optimizing the presented idea. Another option, of course, would be to write a mex extension.
It sounds like what you are trying to do is have an ismember function for use on real valued data.
That is, check for each value B(i) in your vector B whether B(i) is within the tolerance threshold T of at least one value in your vector A
This works out something like the following:
tf = false(1, length(b)); %//the result vector, true if that element of b is in a
t = 0.01; %// the tolerance threshold
for i = 1:length(b)
%// is the absolute difference between the
%//element of a and b less that the threshold?
matches = abs(a - b(i)) < t;
%// if b(i) matches any of the elements of a
tf(i) = any(matches);
end
Or, in short:
t = 0.01;
tf = arrayfun(#(bi) any(abs(a - bi) < t), b);
Regarding avoiding the for loop: while this might benefit from vectorization, you may also want to consider looking at parallelisation if your data is that huge. In that case having a for loop as in my first example can be handy since you can easily do a basic version of parallel processing by changing the for to parfor.
Here is a fully vectorized solution. Note that I would actually recommend the solution given by #Alan, as mine is not likely to work for big datasets.
[X Y]=meshgrid(A,B)
M=abs(X-Y)<tolerance
Now the logical index of elements in a that are within the tolerance can be obtained with any(M) and the index for B is found by any(M,2)
bsxfun to the rescue
>> M = abs( bsxfun(#minus, A, B' ) ); %//' difference
>> M < tolerance
Another way to do what you want is with a logical expression.
Since A and B are vectors of different sizes you can't simply subtract and look for values that are smaller than the tolerance, but you can do the following:
Lmat = sparse((abs(repmat(A,[numel(B) 1])-repmat(B',[1 numel(A)])))<tolerance);
and you will get a sparse logical matrix with as many ones in it as equal elements (within tolerance). You could then count how many of those elements you have by writing:
Nequal = sum(sum(Lmat));
You could also get the indexes of the corresponding elements by writing:
[r,c] = find(Lmat);
then the following code will be true (for all j in numel(r)):
B(r(j))==A(c(j))
Finally, you should note that this way you get multiple counts in case there are duplicate entries in A or in B. It may be advisable to use the unique function first. For example:
A_new = unique(A);

MATLAB/General CS: Sampling Without Replacement From Multiple Sets (+Keeping Track of Unsampled Cases)

I currently implementing an optimization algorithm that requires me to sample without replacement from several sets. Although I am coding in MATLAB, this is essentially a CS question.
The situation is as follows:
I have a finite number of sets (A, B, C) each with a finite but possibly different number of elements (a1,a2...a8, b1,b2...b10, c1, c2...c25). I also have a vector of probabilities for each set which lists a probability for each element in that set (i.e. for set A, P_A = [p_a1 p_a2... p_a8] where sum(P_A) = 1). I normally use these to create a probability generating function for each set, which given a uniform number between 0 to 1, can spit out one of the elements from that set (i.e. a function P_A(u), which given u = 0.25, will select a2).
I am looking to sample without replacement from the sets A, B, and C. Each "full sample" is a sequence of elements from each of the different sets i.e. (a1, b3, c2). Note that the space of full samples is the set of all permutations of the elements in A, B, and C. In the example above, this space is (a1,a2...a8) x (b1,b2...b10) x (c1, c2...c25) and there are 8*10*25 = 2000 unique "full samples" in my space.
The annoying part of sampling without replacement with this setup is that if my first sample is (a1, b3, c2) then that does not mean I cannot sample the element a1 again - it just means that I cannot sample the full sequence (a1, b3, c2) again. Another annoying part is that the algorithm I am working with requires me do a function evaluation for all permutations of elements that I have not sampled.
The best method at my disposal right now is to keep track the sampled cases. This is a little inefficient since my sampler is forced to reject any case that has been sampled before (since I'm sampling without replacement). I then do the function evaluations for the unsampled cases, by going through each permutation (ax, by, cz) using nested for loops and only doing the function evaluation if that combination of (ax, by, cz) is not included in the sampled cases. Again, this is a little inefficient since I have to "check" whether each permutation (ax, by, cz) has already been sampled.
I would appreciate any advice in regards to this problem. In particular, I am looking a method to sample without replacement and keep track of unsampled cases that does not explicity list out the full sample space (I usually work with 10 sets with 10 elements each so listing out the full sample space would require a 10^10 x 10 matrix). I realize that this may be impossible, though finding efficient way to do it will allow me to demonstrate the true limits of the algorithm.
Do you really need to keep track of all of the unsampled cases? Even if you had a 1-by-1010 vector that stored a logical value of true or false indicating if that permutation had been sampled or not, that would still require about 10 GB of storage, and MATLAB is likely to either throw an "Out of Memory" error or bring your entire machine to a screeching halt if you try to create a variable of that size.
An alternative to consider is storing a sparse vector of indicators for the permutations you've already sampled. Let's consider your smaller example:
A = 1:8;
B = 1:10;
C = 1:25;
nA = numel(A);
nB = numel(B);
nC = numel(C);
beenSampled = sparse(1,nA*nB*nC);
The 1-by-2000 sparse matrix beenSampled is empty to start (i.e. it contains all zeroes) and we will add a one at a given index for each sampled permutation. We can get a new sample permutation using the function RANDI to give us indices into A, B, and C for the new set of values:
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
We can then convert these three indices into a single unique linear index into beenSampled using the function SUB2IND:
index = sub2ind([nA nB nC],indexA,indexB,indexC);
Now we can test the indexed element in beenSampled to see if it has a value of 1 (i.e. we sampled it already) or 0 (i.e. it is a new sample). If it has been sampled already, we repeat the process of finding a new set of indices above. Once we have a permutation we haven't sampled yet, we can process it:
while beenSampled(index)
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
index = sub2ind([nA nB nC],indexA,indexB,indexC);
end
beenSampled(index) = 1;
newSample = [A(indexA) B(indexB) C(indexC)];
%# ...do your subsequent processing...
The use of a sparse array will save you a lot of space if you're only going to end up sampling a small portion of all of the possible permutations. For smaller total numbers of permutations, like in the above example, I would probably just use a logical vector instead of a sparse vector.
Check the matlab documentation for the randi function; you'll just want to use that in conjunction with the length function to choose random entries from each vector. Keeping track of each sampled vector should be as simple as just concatenating it to a matrix;
current_values = [5 89 45]; % lets say this is your current sample set
used_values = [used_values; current_values];
% wash, rinse, repeat

Compact MATLAB matrix indexing notation

I've got an n-by-k sized matrix, containing k numbers per row. I want to use these k numbers as indexes into a k-dimensional matrix. Is there any compact way of doing so in MATLAB or must I use a for loop?
This is what I want to do (in MATLAB pseudo code), but in a more MATLAB-ish way:
for row=1:1:n
finalTable(row) = kDimensionalMatrix(indexmatrix(row, 1),...
indexmatrix(row, 2),...,indexmatrix(row, k))
end
If you want to avoid having to use a for loop, this is probably the cleanest way to do it:
indexCell = num2cell(indexmatrix, 1);
linearIndexMatrix = sub2ind(size(kDimensionalMatrix), indexCell{:});
finalTable = kDimensionalMatrix(linearIndexMatrix);
The first line puts each column of indexmatrix into separate cells of a cell array using num2cell. This allows us to pass all k columns as a comma-separated list into sub2ind, a function that converts subscripted indices (row, column, etc.) into linear indices (each matrix element is numbered from 1 to N, N being the total number of elements in the matrix). The last line uses these linear indices to replace your for loop. A good discussion about matrix indexing (subscript, linear, and logical) can be found here.
Some more food for thought...
The tendency to shy away from for loops in favor of vectorized solutions is something many MATLAB users (myself included) have become accustomed to. However, newer versions of MATLAB handle looping much more efficiently. As discussed in this answer to another SO question, using for loops can sometimes result in faster-running code than you would get with a vectorized solution.
I'm certainly NOT saying you shouldn't try to vectorize your code anymore, only that every problem is unique. Vectorizing will often be more efficient, but not always. For your problem, the execution speed of for loops versus vectorized code will probably depend on how big the values n and k are.
To treat the elements of the vector indexmatrix(row, :) as separate subscripts, you need the elements as a cell array. So, you could do something like this
subsCell = num2cell( indexmatrix( row, : ) );
finalTable( row ) = kDimensionalMatrix( subsCell{:} );
To expand subsCell as a comma-separated-list, unfortunately you do need the two separate lines. However, this code is independent of k.
Convert your sub-indices into linear indices in a hacky way
ksz = size(kDimensionalMatrix);
cksz = cumprod([ 1 ksz(1:end-1)] );
lidx = ( indexmatrix - 1 ) * cksz' + 1; #'
% lindx is now (n)x1 linear indices into kDimensionalMatrix, one index per row of indexmatrix
% access all n values:
selectedValues = kDimensionalMatrix( lindx );
Cheers!