I have two vectors (called A and B) with length N. Then I need to multiply both of them, but as an "integration" process. Which means I have to multiply first A(1)*B(1), then A(1:2)*B(1:2), until A(1:N)*B(1:N). The result of multiplying booth vector is a number, since B is a column vector. I've done it with a for loop:
for k = 1:N
C(k) = A(1:k) * B(1:k).';
end
But I wanted to ask you if this is the best solution or there is any other option more time-efficient, since N is very large (about 110,000)
C = cumsum(A.*B)
does the same thing without for loop. As EBH suggested in the comments if you are not sure whether A and B have same orientation, then use
C = cumsum(A(:).*B(:))
I have a 151-by-151 matrix A. It's a correlation matrix, so there are 1s on the main diagonal and repeated values above and below the main diagonal. Each row/column represents a person.
For a given integer n I will seek to reduce the size of the matrix by kicking people out, such that I am left with a n-by-n correlation matrix that minimises the total sum of the elements. In addition to obtaining the abbreviated matrix, I also need to know the row number of the people who should be booted out of the original matrix (or their column number - they'll be the same number).
As a starting point I take A = tril(A), which will remove redundant off-diagonal elements from the correlation matrix.
So, if n = 4 and we have the hypothetical 5-by-5 matrix above, it's very clear that person 5 should be kicked out of the matrix, since that person is contributing a lot of very high correlations.
It's also clear that person 1 should not be kicked out, since that person contributes a lot of negative correlations, and thus brings down the sum of the matrix elements.
I understand that sum(A(:)) will sum everything in the matrix. However, I'm very unclear about how to search for the minimum possible answer.
I noticed a similar question Finding sub-matrix with minimum elementwise sum, which has a brute force solution as the accepted answer. While that answer works fine there it's impractical for a 151-by-151 matrix.
EDIT: I had thought of iterating, but I don't think that truly minimizes the sum of elements in the reduced matrix. Below I have a 4-by-4 correlation matrix in bold, with sums of rows and columns on the edges. It's apparent that with n = 2 the optimal matrix is the 2-by-2 identity matrix involving Persons 1 and 4, but according to the iterative scheme I would have kicked out Person 1 in the first phase of iteration, and so the algorithm makes a solution that is not optimal. I wrote a program that always generated optimal solutions, and it works well when n or k are small, but when trying to make an optimal 75-by-75 matrix from a 151-by-151 matrix I realised my program would take billions of years to terminate.
I vaguely recalled that sometimes these n choose k problems can be resolved with dynamic programming approaches that avoid recomputing things, but I can't work out how to solve this, and nor did googling enlighten me.
I'm willing to sacrifice precision for speed if there's no other option, or the best program will take more than a week to generate a precise solution. However, I'm happy to let a program run for up to a week if it will generate a precise solution.
If it's not possible for a program to optimise the matrix within an reasonable timeframe, then I would accept an answer that explains why n choose k tasks of this particular sort can't be resolved within reasonable timeframes.
This is an approximate solution using a genetic algorithm.
I started with your test case:
data_points = 10; % How many data points will be generated for each person, in order to create the correlation matrix.
num_people = 25; % Number of people initially.
to_keep = 13; % Number of people to be kept in the correlation matrix.
to_drop = num_people - to_keep; % Number of people to drop from the correlation matrix.
num_comparisons = 100; % Number of times to compare the iterative and optimization techniques.
for j = 1:data_points
rand_dat(j,:) = 1 + 2.*randn(num_people,1); % Generate random data.
end
A = corr(rand_dat);
then I defined the functions you need to evolve the genetic algorithm:
function individuals = user1205901individuals(nvars, FitnessFcn, gaoptions, num_people)
individuals = zeros(num_people,gaoptions.PopulationSize);
for cnt=1:gaoptions.PopulationSize
individuals(:,cnt)=randperm(num_people);
end
individuals = individuals(1:nvars,:)';
is the individual generation function.
function fitness = user1205901fitness(ind, A)
fitness = sum(sum(A(ind,ind)));
is the fitness evaluation function
function offspring = user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
offspring=zeros(length(parents),nvars);
for cnt=1:length(parents)
original = thisPopulation(parents(cnt),:);
extraneus = setdiff(1:num_people, original);
original(fix(rand()*nvars)+1) = extraneus(fix(rand()*(num_people-nvars))+1);
offspring(cnt,:)=original;
end
is the function to mutate an individual
function children = user1205901crossover(parents, options, nvars, FitnessFcn, unused, thisPopulation)
children=zeros(length(parents)/2,nvars);
cnt = 1;
for cnt1=1:2:length(parents)
cnt2=cnt1+1;
male = thisPopulation(parents(cnt1),:);
female = thisPopulation(parents(cnt2),:);
child = union(male, female);
child = child(randperm(length(child)));
child = child(1:nvars);
children(cnt,:)=child;
cnt = cnt + 1;
end
is the function to generate a new individual coupling two parents.
At this point you can define your problem:
gaproblem2.fitnessfcn=#(idx)user1205901fitness(idx,A)
gaproblem2.nvars = to_keep
gaproblem2.options = gaoptions()
gaproblem2.options.PopulationSize=40
gaproblem2.options.EliteCount=10
gaproblem2.options.CrossoverFraction=0.1
gaproblem2.options.StallGenLimit=inf
gaproblem2.options.CreationFcn= #(nvars,FitnessFcn,gaoptions)user1205901individuals(nvars,FitnessFcn,gaoptions,num_people)
gaproblem2.options.CrossoverFcn= #(parents,options,nvars,FitnessFcn,unused,thisPopulation)user1205901crossover(parents,options,nvars,FitnessFcn,unused,thisPopulation)
gaproblem2.options.MutationFcn=#(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation) user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
gaproblem2.options.Vectorized='off'
open the genetic algorithm tool
gatool
from the File menu select Import Problem... and choose gaproblem2 in the window that opens.
Now, run the tool and wait for the iterations to stop.
The gatool enables you to change hundreds of parameters, so you can trade speed for precision in the selected output.
The resulting vector is the list of indices that you have to keep in the original matrix so A(garesults.x,garesults.x) is the matrix with only the desired persons.
If I have understood you problem statement, you have a N x N matrix M (which happens to be a correlation matrix), and you wish to find for integer n where 2 <= n < N, a n x n matrix m which minimises the sum over all elements of m which I denote f(m)?
In Matlab it is fairly easy and fast to obtain a sub-matrix of a matrix (see for example Removing rows and columns from matrix in Matlab), and the function f is relatively inexpensive to evaluate for n = 151. So why can't you implement an algorithm that solves this backwards dynamically in a program as below where I have sketched out the pseudocode:
function reduceM(M, n){
m = M
for (ii = N to n+1) {
for (jj = 1 to ii) {
val(jj) = f(m) where mhas column and row jj removed, f(X) being summation over all elements of X
}
JJ(ii) = jj s.t. val(jj) is smallest
m = m updated by removing column and row JJ(ii)
}
}
In the end you end up with an m of dimension n which is the solution to your problem and a vector JJ which contains the indices removed at each iteration (you should easily be able to convert these back to indices applicable to the full matrix M)
There are several approaches to finding an approximate solution (eg. quadratic programming on relaxed problem or greedy search), but finding the exact solution is an NP-hard problem.
Disclaimer: I'm not an expert on binary quadratic programming, and you may want to consult the academic literature for more sophisticated algorithms.
Mathematically equivalent formulation:
Your problem is equivalent to:
For some symmetric, positive semi-definite matrix S
minimize (over vector x) x'*S*x
subject to 0 <= x(i) <= 1 for all i
sum(x)==n
x(i) is either 1 or 0 for all i
This is a quadratic programming problem where the vector x is restricted to taking only binary values. Quadratic programming where the domain is restricted to a set of discrete values is called mixed integer quadratic programming (MIQP). The binary version is sometimes called Binary Quadratic Programming (BQP). The last restriction, that x is binary, makes the problem substantially more difficult; it destroys the problem's convexity!
Quick and dirty approach to finding an approximate answer:
If you don't need a precise solution, something to play around with might be a relaxed version of the problem: drop the binary constraint. If you drop the constraint that x(i) is either 1 or 0 for all i, then the problem becomes a trivial convex optimization problem and can be solved nearly instantaneously (eg. by Matlab's quadprog). You could try removing entries that, on the relaxed problem, quadprog assigns the lowest values in the x vector, but this does not truly solve the original problem!
Note also that the relaxed problem gives you a lower bound on the optimal value of the original problem. If your discretized version of the solution to the relaxed problem leads to a value for the objective function close to the lower bound, there may be a sense in which this ad-hoc solution can't be that far off from the true solution.
To solve the relaxed problem, you might try something like:
% k is number of observations to drop
n = size(S, 1);
Aeq = ones(1,n)
beq = n-k;
[x_relax, f_relax] = quadprog(S, zeros(n, 1), [], [], Aeq, beq, zeros(n, 1), ones(n, 1));
f_relax = f_relax * 2; % Quadprog solves .5 * x' * S * x... so mult by 2
temp = sort(x_relax);
cutoff = temp(k);
x_approx = ones(n, 1);
x_approx(x_relax <= cutoff) = 0;
f_approx = x_approx' * S * x_approx;
I'm curious how good x_approx is? This doesn't solve your problem, but it might not be horrible! Note that f_relax is a lower bound on the solution to the original problem.
Software to solve your exact problem
You should check out this link and go down to the section on Mixed Integer Quadratic Programming (MIQP). It looks to me that Gurobi can solve problems of your type. Another list of solvers is here.
Working on a suggestion from Matthew Gunn and also some advice at the Gurobi forums, I came up with the following function. It seems to work pretty well.
I will award it the answer, but if someone can come up with code that works better I'll remove the tick from this answer and place it on their answer instead.
function [ values ] = the_optimal_method( CM , num_to_keep)
%the_iterative_method Takes correlation matrix CM and number to keep, returns list of people who should be kicked out
N = size(CM,1);
clear model;
names = strseq('x',[1:N]);
model.varnames = names;
model.Q = sparse(CM); % Gurobi needs a sparse matrix as input
model.A = sparse(ones(1,N));
model.obj = zeros(1,N);
model.rhs = num_to_keep;
model.sense = '=';
model.vtype = 'B';
gurobi_write(model, 'qp.mps');
results = gurobi(model);
values = results.x;
end
I need to create a function that has the input argument n, a integer , n>1 , and an output argument v, which is a column vector of length n containing all the positive integers smaller than or equal to n, arranged in such a way that no element of the vector equals its own index.
I know how to define the function
This is what I tried so far but it doesn't work
function[v]=int_col(n)
[1,n] = size(n);
k=1:n;
v=n(1:n);
v=k'
end
Let's take a look at what you have:
[1,n] = size(n);
This line doesn't make a lot of sense: n is an integer, which means that size(n) will give you [1,1], you don't need that. (Also an expression like [1,n] can't be on the left hand side of an assignment.) Drop that line. It's useless.
k=1:n;
That line is pretty good, k is now a row vector of size n containing the integers from 1 to n.
v=n(1:n);
Doesn't make sense. n isn't a vector (or you can say it's a 1x1 vector) either way, indexing into it (that's what the parentheses do) doesn't make sense. Drop that line too.
v=k'
That's also a nice line. It makes a column vector v out of your row vector k. The only thing that this doesn't satisfy is the "arranged in such a way that no element of the vector equals its own index" part, since right now every element equals its own index. So now you need to find a way to either shift those elements or shuffle them around in some way that satisfies this condition and you'd be done.
Let's give a working solution. You should really look into it and see how this thing works. It's important to solve the problem in smaller steps and to know what the code is doing.
function [v] = int_col(n)
if n <= 1
error('argument must be >1')
end
v = 1:n; % generate a row-vector of 1 to n
v = v'; % make it a column vector
v = circshift(v,1); % shift all elements by 1
end
This is the result:
>> int_col(5)
ans =
5
1
2
3
4
Instead of using circshift you can do the following as well:
v = [v(end);v(1:end-1)];
I need some help to vectorize the following operation since I'm a little confused.
So, I have a m-by-2 matrix A and n-by-1 vector b. I want to create a n-by-1 vector c whose entries should be the values of the second column of A whose line is given by the line where the correspondent value of b would fall...
Not sure if I was clear enough. Anyway, the code below does compute c correctly so you can understand what is my desired output. However, I want to vectorize this function since my real n and m are in the order of many thousands.
Note that values of bare non-integer and not necessarily equal to any of those in the first column of A (these ones could be non-integers too!).
m = 5; n = 10;
A = [(0:m-1)*1.1;rand(1,m)]'
b = (m-1)*rand(n,1)
[bincounts, ind] = histc(b,A(:,1))
for i = 1:n
c(i) = A(ind(i),2);
end
All you need is:
c = A(ind,2);
Using MATLAB,
Imagine a Nx6 array of numbers which represent N segments with 3+3=6 initial and end point coordinates.
Assume I have a function Calc_Dist( Segment_1, Segment_2 ) that takes as input two 1x6 arrays, and that after some operations returns a scalar, namely the minimal euclidean distance between these two segments.
I want to calculate the pairwise minimal distance between all N segments of my list, but would like to avoid a double loop to do so.
I cannot wrap my head around the documentation of the bsxfun function of MATLAB, so I cannot make this work. For the sake of a minimal example (the distance calculation is obviously not correct):
function scalar = calc_dist( segment_1, segment_2 )
scalar = sum( segment_1 + segment_2 )
end
and the main
Segments = rand( 1500, 6 )
Pairwise_Distance_Matrix = bsxfun( #calc_dist, segments, segments' )
Is there any way to do this, or am I forced to use double loops ?
Thank you for any suggestion
I think you need pdist rather than bsxfun. pdist can be used in two different ways, the second of which is applicable to your problem:
With built-in distance functions, supplied as strings, such as 'euclidean', 'hamming' etc.
With a custom distance function, a handle to which you supply.
In the second case, the distance function
must be of the form
function D2 = distfun(XI, XJ),
taking as arguments a 1-by-N vector XI containing a single row of X, an
M2-by-N matrix XJ containing multiple rows of X, and returning an
M2-by-1 vector of distances D2, whose Jth element is the distance
between the observations XI and XJ(J,:).
Although the documentation doesn't tell, it's very likely that the second way is not as efficient as the first (a double loop might even be faster, who knows), but you can use it. You would need to define your function so that it fulfills the stated condition. With your example function it's easy: for this part you'd use bsxfun:
function scalar = calc_dist( segment_1, segment_2 )
scalar = sum(bsxfun(#plus, segment_1, segment_2), 2);
end
Note also that
pdist works with rows (not columns), which is what you need.
pdist reduces operations by exploiting the properties that any distance function must have. Namely, the distance of an element to itself is known to be zero; and the distance for each pair can be computed just once thanks to symmetry. If you want to arrange the output in the form of a matrix, use squareform.
So, after your actual distance function has been modified appropriately (which may be the hard part), use:
distances = squareform(pdist(segments, #calc_dist));
For example:
N = 4;
segments = rand(N,6);
distances = squareform(pdist(segments, #calc_dist));
produces
distances =
0 6.1492 7.0886 5.5016
6.1492 0 6.8559 5.2688
7.0886 6.8559 0 6.2082
5.5016 5.2688 6.2082 0
Unfortunately I don't see any "smarter" (i.e. read faster) solution than the double loop. For speed consideration I'd organize the points as a 6×N array, not the other way, because column access is way faster than row access in MATLAB.
So:
N = 150000;
Segments = rand(6, N);
Pairwise_Distance_Matrix = Inf(N, N);
for i = 1:(N-1)
for j = (i+1):N
Pairwise_Distance_Matrix(i,j) = calc_dist(Segments(:,i), Segments(:,j));
end;
end;
Minimum_Pairwise_Distance = min(min(Pairwise_Distance_Matrix));
Contrary to common wisdom, explicit loops are faster now in MATLAB compared to the likes of arrayfun, cellfun or structfun; bsxfun beats everything else in terms of speed, but it doesn't apply to your case.