Adding a sparse vector to a dense vector in Matlab - matlab

Suppose I have a high dimensional vector v which is dense and another high dimensional vector x which is sparse and I want to do an operation which looks like
v = v + x
Ideally since one needs to update only a few entries in v this operation should be fast but it is still taking a good amount of time even when I have declared x to be sparse. I have tried with v being in full as well as v being in sparse form and both are fairly slow.
I have also tried to extract the indices from the sparse vector x by calling a find and then updating the original vector in a for loop. This is faster than the above operations, but is there a way to achieve the same with much less code.
Thanks

Quoting from the Matlab documentation (emphasis mine):
Binary operators yield sparse results if both operands are sparse, and full results if both are full. For mixed operands, the result is full unless the operation preserves sparsity. If S is sparse and F is full, then S+F, S*F, and F\S are full, while S.*F and S&F are sparse. In some cases, the result might be sparse even though the matrix has few zero elements.
Therefore, if you wish to keep x sparse, I think using logical indexing to update v with the nonzero values of x is best. Here is a sample function that shows either logical indexing or explicitly full-ing x is best (at least on my R2015a install):
function [] = blur()
n = 5E6;
v = rand(n,1);
x = sprand(n,1,0.001);
xf = full(x);
vs = sparse(v);
disp(['Full-Sparse: ',num2str(timeit(#() v + x) ,'%9.5f')]);
disp(['Full-Full: ',num2str(timeit(#() v + xf) ,'%9.5f')]);
disp(['Sparse-Sparse: ',num2str(timeit(#() vs + x) ,'%9.5f')]);
disp(['Logical Index: ',num2str(timeit(#() update(v,x)),'%9.5f')]);
end
function [] = update(v,x)
mask = x ~= 0;
v(mask) = v(mask) + x(mask);
end

Related

Row--wise application of an inline function

I defined an inline function f that takes as argument a (1,3) vector
a = [3;0.5;1];
b = 3 ;
f = #(x) x*a+b ;
Suppose I have a matrix X of size (N,3). If I want to apply f to each row of X, I can simply write :
f(X)
I verified that f(X) is a (N,1) vector such that f(X)(i) = f(X(i,:)).
Now, if I a add a quadratic term :
f = #(x) x*A*x' + x*a + b ;
the command f(X) raises an error :
Error using +
Matrix dimensions must agree.
Error in #(x) x*A*x' + x*a + b
I guess Matlab is considering the whole matrix X as the input to f. So it does not create a vector with each row, i, being equal to f(X(i,:)). How can I do it ?
I found out that there exist a built-in function rowfun that could help me, but it seems to be available only in versions r2016 (I have version r2015a)
That is correct, and expected.
MATLAB tries to stay close to mathematical notation, and what you are doing (X*A*X' for A 3×3 and X N×3) is valid math, but not quite what you intend to do -- you'll end up with a N×N matrix, which you cannot add to the N×1 matrix x*a.
The workaround is simple, but ugly:
f_vect = #(x) sum( (x*A).*x, 2 ) + x*a + b;
Now, unless your N is enormous, and you have to do this billions of times every minute of every day, the performance of this is more than acceptable.
Iff however this really and truly is your program's bottleneck, than I'd suggest taking a look at MMX on the File Exchange. Together with permute(), this will allow you to use those fast BLAS/MKL operations to do this calculation, speeding it up a notch.
Note that bsxfun isn't going to work here, because that does not support mtimes() (matrix multiplication).
You can also upgrade to MATLAB R2016b, which will have built-in implicit dimension expansion, presumably also for mtimes() -- but better check, not sure about that one.

MATLAB: Find abbreviated version of matrix that minimises sum of matrix elements

I have a 151-by-151 matrix A. It's a correlation matrix, so there are 1s on the main diagonal and repeated values above and below the main diagonal. Each row/column represents a person.
For a given integer n I will seek to reduce the size of the matrix by kicking people out, such that I am left with a n-by-n correlation matrix that minimises the total sum of the elements. In addition to obtaining the abbreviated matrix, I also need to know the row number of the people who should be booted out of the original matrix (or their column number - they'll be the same number).
As a starting point I take A = tril(A), which will remove redundant off-diagonal elements from the correlation matrix.
So, if n = 4 and we have the hypothetical 5-by-5 matrix above, it's very clear that person 5 should be kicked out of the matrix, since that person is contributing a lot of very high correlations.
It's also clear that person 1 should not be kicked out, since that person contributes a lot of negative correlations, and thus brings down the sum of the matrix elements.
I understand that sum(A(:)) will sum everything in the matrix. However, I'm very unclear about how to search for the minimum possible answer.
I noticed a similar question Finding sub-matrix with minimum elementwise sum, which has a brute force solution as the accepted answer. While that answer works fine there it's impractical for a 151-by-151 matrix.
EDIT: I had thought of iterating, but I don't think that truly minimizes the sum of elements in the reduced matrix. Below I have a 4-by-4 correlation matrix in bold, with sums of rows and columns on the edges. It's apparent that with n = 2 the optimal matrix is the 2-by-2 identity matrix involving Persons 1 and 4, but according to the iterative scheme I would have kicked out Person 1 in the first phase of iteration, and so the algorithm makes a solution that is not optimal. I wrote a program that always generated optimal solutions, and it works well when n or k are small, but when trying to make an optimal 75-by-75 matrix from a 151-by-151 matrix I realised my program would take billions of years to terminate.
I vaguely recalled that sometimes these n choose k problems can be resolved with dynamic programming approaches that avoid recomputing things, but I can't work out how to solve this, and nor did googling enlighten me.
I'm willing to sacrifice precision for speed if there's no other option, or the best program will take more than a week to generate a precise solution. However, I'm happy to let a program run for up to a week if it will generate a precise solution.
If it's not possible for a program to optimise the matrix within an reasonable timeframe, then I would accept an answer that explains why n choose k tasks of this particular sort can't be resolved within reasonable timeframes.
This is an approximate solution using a genetic algorithm.
I started with your test case:
data_points = 10; % How many data points will be generated for each person, in order to create the correlation matrix.
num_people = 25; % Number of people initially.
to_keep = 13; % Number of people to be kept in the correlation matrix.
to_drop = num_people - to_keep; % Number of people to drop from the correlation matrix.
num_comparisons = 100; % Number of times to compare the iterative and optimization techniques.
for j = 1:data_points
rand_dat(j,:) = 1 + 2.*randn(num_people,1); % Generate random data.
end
A = corr(rand_dat);
then I defined the functions you need to evolve the genetic algorithm:
function individuals = user1205901individuals(nvars, FitnessFcn, gaoptions, num_people)
individuals = zeros(num_people,gaoptions.PopulationSize);
for cnt=1:gaoptions.PopulationSize
individuals(:,cnt)=randperm(num_people);
end
individuals = individuals(1:nvars,:)';
is the individual generation function.
function fitness = user1205901fitness(ind, A)
fitness = sum(sum(A(ind,ind)));
is the fitness evaluation function
function offspring = user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
offspring=zeros(length(parents),nvars);
for cnt=1:length(parents)
original = thisPopulation(parents(cnt),:);
extraneus = setdiff(1:num_people, original);
original(fix(rand()*nvars)+1) = extraneus(fix(rand()*(num_people-nvars))+1);
offspring(cnt,:)=original;
end
is the function to mutate an individual
function children = user1205901crossover(parents, options, nvars, FitnessFcn, unused, thisPopulation)
children=zeros(length(parents)/2,nvars);
cnt = 1;
for cnt1=1:2:length(parents)
cnt2=cnt1+1;
male = thisPopulation(parents(cnt1),:);
female = thisPopulation(parents(cnt2),:);
child = union(male, female);
child = child(randperm(length(child)));
child = child(1:nvars);
children(cnt,:)=child;
cnt = cnt + 1;
end
is the function to generate a new individual coupling two parents.
At this point you can define your problem:
gaproblem2.fitnessfcn=#(idx)user1205901fitness(idx,A)
gaproblem2.nvars = to_keep
gaproblem2.options = gaoptions()
gaproblem2.options.PopulationSize=40
gaproblem2.options.EliteCount=10
gaproblem2.options.CrossoverFraction=0.1
gaproblem2.options.StallGenLimit=inf
gaproblem2.options.CreationFcn= #(nvars,FitnessFcn,gaoptions)user1205901individuals(nvars,FitnessFcn,gaoptions,num_people)
gaproblem2.options.CrossoverFcn= #(parents,options,nvars,FitnessFcn,unused,thisPopulation)user1205901crossover(parents,options,nvars,FitnessFcn,unused,thisPopulation)
gaproblem2.options.MutationFcn=#(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation) user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
gaproblem2.options.Vectorized='off'
open the genetic algorithm tool
gatool
from the File menu select Import Problem... and choose gaproblem2 in the window that opens.
Now, run the tool and wait for the iterations to stop.
The gatool enables you to change hundreds of parameters, so you can trade speed for precision in the selected output.
The resulting vector is the list of indices that you have to keep in the original matrix so A(garesults.x,garesults.x) is the matrix with only the desired persons.
If I have understood you problem statement, you have a N x N matrix M (which happens to be a correlation matrix), and you wish to find for integer n where 2 <= n < N, a n x n matrix m which minimises the sum over all elements of m which I denote f(m)?
In Matlab it is fairly easy and fast to obtain a sub-matrix of a matrix (see for example Removing rows and columns from matrix in Matlab), and the function f is relatively inexpensive to evaluate for n = 151. So why can't you implement an algorithm that solves this backwards dynamically in a program as below where I have sketched out the pseudocode:
function reduceM(M, n){
m = M
for (ii = N to n+1) {
for (jj = 1 to ii) {
val(jj) = f(m) where mhas column and row jj removed, f(X) being summation over all elements of X
}
JJ(ii) = jj s.t. val(jj) is smallest
m = m updated by removing column and row JJ(ii)
}
}
In the end you end up with an m of dimension n which is the solution to your problem and a vector JJ which contains the indices removed at each iteration (you should easily be able to convert these back to indices applicable to the full matrix M)
There are several approaches to finding an approximate solution (eg. quadratic programming on relaxed problem or greedy search), but finding the exact solution is an NP-hard problem.
Disclaimer: I'm not an expert on binary quadratic programming, and you may want to consult the academic literature for more sophisticated algorithms.
Mathematically equivalent formulation:
Your problem is equivalent to:
For some symmetric, positive semi-definite matrix S
minimize (over vector x) x'*S*x
subject to 0 <= x(i) <= 1 for all i
sum(x)==n
x(i) is either 1 or 0 for all i
This is a quadratic programming problem where the vector x is restricted to taking only binary values. Quadratic programming where the domain is restricted to a set of discrete values is called mixed integer quadratic programming (MIQP). The binary version is sometimes called Binary Quadratic Programming (BQP). The last restriction, that x is binary, makes the problem substantially more difficult; it destroys the problem's convexity!
Quick and dirty approach to finding an approximate answer:
If you don't need a precise solution, something to play around with might be a relaxed version of the problem: drop the binary constraint. If you drop the constraint that x(i) is either 1 or 0 for all i, then the problem becomes a trivial convex optimization problem and can be solved nearly instantaneously (eg. by Matlab's quadprog). You could try removing entries that, on the relaxed problem, quadprog assigns the lowest values in the x vector, but this does not truly solve the original problem!
Note also that the relaxed problem gives you a lower bound on the optimal value of the original problem. If your discretized version of the solution to the relaxed problem leads to a value for the objective function close to the lower bound, there may be a sense in which this ad-hoc solution can't be that far off from the true solution.
To solve the relaxed problem, you might try something like:
% k is number of observations to drop
n = size(S, 1);
Aeq = ones(1,n)
beq = n-k;
[x_relax, f_relax] = quadprog(S, zeros(n, 1), [], [], Aeq, beq, zeros(n, 1), ones(n, 1));
f_relax = f_relax * 2; % Quadprog solves .5 * x' * S * x... so mult by 2
temp = sort(x_relax);
cutoff = temp(k);
x_approx = ones(n, 1);
x_approx(x_relax <= cutoff) = 0;
f_approx = x_approx' * S * x_approx;
I'm curious how good x_approx is? This doesn't solve your problem, but it might not be horrible! Note that f_relax is a lower bound on the solution to the original problem.
Software to solve your exact problem
You should check out this link and go down to the section on Mixed Integer Quadratic Programming (MIQP). It looks to me that Gurobi can solve problems of your type. Another list of solvers is here.
Working on a suggestion from Matthew Gunn and also some advice at the Gurobi forums, I came up with the following function. It seems to work pretty well.
I will award it the answer, but if someone can come up with code that works better I'll remove the tick from this answer and place it on their answer instead.
function [ values ] = the_optimal_method( CM , num_to_keep)
%the_iterative_method Takes correlation matrix CM and number to keep, returns list of people who should be kicked out
N = size(CM,1);
clear model;
names = strseq('x',[1:N]);
model.varnames = names;
model.Q = sparse(CM); % Gurobi needs a sparse matrix as input
model.A = sparse(ones(1,N));
model.obj = zeros(1,N);
model.rhs = num_to_keep;
model.sense = '=';
model.vtype = 'B';
gurobi_write(model, 'qp.mps');
results = gurobi(model);
values = results.x;
end

How to find a value of one vector, in a range of value in another vector in matlab

I have a vector A with size of 54000 x 1 and vector B with size of 54000 x 1 which is standard deviation of elements of A. From the other side, I have vector C with size of 300000 x 1. Now I want to find that each element of vector C correspondences to which row of vector A with accepted range 3*standard deviation? I have written the below code and it works fine for small vector but for large vector such that I have it is too too slow!!
for i=1:length(A)
L=A(i,1)- 3*B(i,1);
U=A(i,1)+ 3*B(i,1);
inds{i,1} = not(abs(sign(sign(L - C) + sign(U - C))));
end
Does anybody know how can I make this code faster or does anybody know another solution? THX.
MATLAB is an acronym for Matrix Laboratory and was developed to simply and speed up matrix(vector) calculations. Compared to C or any other programming language you can often skip the for loops when working with matrices. For your code you should be able to skip the for loop and do it as this:
L = A - 3*STD;
U = A + 3*STD;
inds = not(abs(sign(sign(L - C) + sign(U - C))));
And remember that i means the complex number in Matlab. Don't know if it affects speed though.
Edit:
Getting the result as a cell:
inds = num2cell(inds)

Accelerate the calculation of inv(X'*X)*Q*inv(X'*X) in Matlab?

I have to calculate Newey-West standard errors for large multiple regression models.
The final step of this calculation is to obtain
nwse = sqrt(diag(N.*inv(X'*X)*Q*inv(X'*X)));
This file exchange contribution implements this as
nwse = sqrt(diag(N.*((X'*X)\Q/(X'*X))));
This looks reasonable, but in my case (5000x5000 sparse Q and X'*X) it's far too slow for my needs (about 30secs, I have to repeat this for about one million different models). Any ideas how to make this line faster?
Please note that I need only the diagonal, not the entire matrix and that both Q and (X'*X) are positive-definite.
I believe you can save a lot of computation time by explicitly doing an LU factorization, [l, u, p, q] = lu(X'*X); and use those factors when doing the calculations. Also, since X are constant for about 100 models, pre-calculating X'*X will most likely save you some time.
Note that in your case, the most time demanding operation might very well be the sqrt-function.
% Constant for every 100 models or so:
M = X'*X;
[l, u, p, q] = lu(M);
% Now, I guess this should be quite a bit faster (I might have messed up the order):
nwse = sqrt(diag(N.*(q * ( u \ (l \ (p * Q))) * q * (u \ (l \ p)))));
The first two terms are commonly used:
l the lower triangular matrix
u the upper triangular matrix
Now, p and q are a bit more uncommon.
p is a row permutation matrix used to obtain numerical stability. There is not much performance gain in doing [l, u, p] = lu(M) compared to [l, u] = lu(M) for sparse matrices.
q however offers a significant performance gain. q is a column permutation matrix that is used to reduce the amount of fill when doing the factorization.
Note that the [l, u, p, q] = lu(M) is only a valid syntax for sparse matrices.
As for why using full pivoting as described above should be faster:
Try the following to see the purpose of the column permutation matrix q. It is easier to work with elements that are aligned around the diagonal.
S = sprand(100,100,0.01);
[l, u, p] = lu(S);
spy(l)
figure
spy(u)
Now, compare it with this:
[ll, uu, pp, qq] = lu(S);
spy(ll);
figure
spy(uu);
Unfortunately, I don't have MATLAB here right now, so I can't guarantee that I put all the arguments in the correct order, but I think it's correct.
Following the helpful answer of Robert_P and the comments of Parag, I found the following to be the fastest for my particular large-scale sparse data:
L=chol(X'*X,'lower'); L=full(L);
invXtX = L'\(L\ speye(size(X,2)));
nwse = sqrt(N.*sum(invXtX.*(Q*invXtX)));
The last line computes the diagonal efficiently, idea taken from here.

Atomic sparse-matrix-multiply-and-compare to avoid out-of-memory error

I have two sparse matrices A (logical, 80274 x 80274) and B (non-negative integer, 21018 x 80274) and a vector c (positive integer, 21018 x 1).
I'd like to find the result res (logical, 21018 x 80274) of
mat = B * A;
res = mat > sparse(diag(c - 1)) * spones(mat);
# Note that since c is positive, this is equivalent to
# res = bsxfun(#gt, mat, c-1)
# but octave's sparse bsxfun support is somewhat shoddy,
# so I'm doing this instead as a workaround
The problem is B * A has enough nonzero values (I think 60824321 which doesn't seem like a lot, but somehow, the computation of spones(mat) uses up over a gigabyte of memory before octave crashes) to exhaust all my machine's memory even though most of these do not exceed c-1.
Is there a way to do this without computing the intermediate matrix mat = B * A ?
CLARIFICATION: It probably doesn't matter, but B and c are actually double matrices that happen to only be holding integer values (and B is sparse).
Can't you just work on the nonzero values of mat? (for the zero values you know the result will be 0):
c = c(:); % make c a column
ind = find(mat>0); % linear index
[row, ~] = ind2sub(size(mat),ind); % row index within mat (for use in c)
res = mat(ind) > c(row)-1; % results for the nonzero values of mat
You can try to update to Octave 3.8.1, bsxfun has been updated to be sparse aware, this greatly enhances the performances!