In Matlab, how to fast generate 'sparse random matrix', and fast multiply it with a dense vector? - matlab

I am using this function X = randsrc(250,600,[[-1,0,1];[0.5/ps,1-1/ps,0.5/ps]])) with ps=2373 It shows that 250*600 matrix is generated. Its entries only contain -1,0 or 1. And -1,0,1 is randomly choosed according to the probability distribution 0.5/ps,1-1/ps,0.5/ps.
So that the density is about 0.00042.
The above X is called sparse random projection matrix, see https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf. It can be used to compress a data vector from dimension 600 to 250 with some nice geometric properties guaranteed.
The problem is that in Matlab, randsrc seems to be very slow (e.g., compared with randn(250,600)). Then, how can I fast generate the above matrix?
BTW, how can I fast calculate X*y? where y may be a dense vector.
My code is:
ps=2373;
tic;
X = randsrc(250,600,[[-1,0,1];[0.5/ps,1-1/ps,0.5/ps]]));
toc
a = randn(600,1);
tic;
X*a;
toc
Also, I have tried a same Python function http://scikit-learn.org/stable/modules/generated/sklearn.random_projection.SparseRandomProjection.html, it is twice faster than Matlab.

You can use sprand to generate a sparsity structure, then find to extract the rows and columns of the non-zero elements. Finally randsample will select values -1,1 with 50% probability of each:
ps=2373;
tic
[i,j,~] = find(sprand(250,600,1/ps))
X = sparse(i,j,randsample([-1,1],length(i),true))
toc
MATLAB is very fast at multiplying matrices so X*a is very fast.

Related

Calculating the most similar pair of column vectors using cosine distance in a matrix

I have a 943x1682 matrix in which I want to calculate the two most similar vectors in this matrix. So I want see the cosine distance of each vector in the matrix to each vector in the matrix, of course not including the vector with itself, if one cannot do that I can just ignore those.
I made this loop to try to calculate this, so I can get a 1682x1682 matrix, with each cell corresponding to the similarity between i and j. However when I run this, it takes forever to run, and when I try to open the resulting matrix in my workspace, it says:
Cannot display summaries of variables with more than 524288 elements.
Is there an easier way to do this or am I doing something wrong?
Cross posted on MATLAB Answers. Repeating answer here:
Use a standard matrix multiply to get the dot products. MATLAB is very fast at standard matrix multiplies. And then normalize the result. E.g.,
AA = A' * A; % the column dot products via a standard matrix multiply
Anorm = sqrt(diag(AA)); % the norms of the columns
Adist = AA ./ (Anorm .* Anorm.'); % normalize the column dot products into cosine distances
Then pick off the maximum value for your answer, disregarding the diagonal. E.g.,
n = size(A,2); % the number of columns
Adist(1:n+1:end) = -inf; % disregard the diagonal (column compared to itself)
[~,x] = max(Adist(:)); % find the max cosine distance linear index
[col1,col2] = ind2sub(size(Adist),x); % convert linear index into the original columns
Then col1 and col2 are the column numbers of the most similiar columns using cosine distance as a measure.
You can normalise the columns of the matrix first, then the cosine similarity equation simplifies to a matrix multiplication:
aNorm = normc(A);
cosSim = aNorm' * aNorm;
Generally, matrix multiplication is more performant than looping. In a quick test, with N = 1000, the looping code takes ~7 seconds and the matrix multiplication code ~0.5 seconds.
The resultant matrix may still be too large to open in your workspace, you could copy any individual rows or columns into a temporary and view those, or do a contour plot (heat-map) of the matrix to get a visual representation.

Scale every row in a sparse matrix by an element of a vector in MATLAB

I have a sparse matrix
obj.resOp = sparse(row,col,val);
and a vector containing the sums of each row in the matrix
sums = sparse(sum(obj.resOp,2));
Now what I want to do is
obj.resOp = obj.resOp ./ sums;
which would scale every row in the matrix so that the rowsum in each row is 1.
However in this last line, MATLAB internally seems to construct a full matrix from obj.resOp and hence I get this error:
Error using ./ Requested 38849x231827 (17.5GB) array exceeds maximum
array size preference. Creation of arrays greater than this limit may
take a long time and cause MATLAB to become unresponsive. See array size limit or preference
panel for more information.
for sufficiently large matrices.
In theory I think that expanding to a full matrix is not necessary. Is there any MATLAB formulation of what I want to achieve while keeping the sparsity of obj.resOp?
You can do this with a method similar to the one described in this answer.
Start with some sparse matrix
% Random sparse matrix: 10 rows, 4 cols, density 20%
S = sprand(10,4, 0.2);
Get the row sums, note that sum returns a sparse matrix from sparse inputs, so no need for your additional conversion (docs).
rowsums = sum(S,2);
Find all non-zero indices and their values
[rowidx, colidx, vals] = find(S)
Now create a sparse matrix from the element-wise division
out = sparse(rowidx, colidx, vals./rowsums(rowidx), size(S,1), size(S,2));
The equivalent calculation
obj.resOp = inv(diag(sums)) * obj.resOp;
works smoothly.

Multivariate Random Number Generation in Matlab

I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.

Duplicating a 2d matrix in matlab along a 3rd axis MANY times

I'm looking to duplication a 784x784 matrix in matlab along a 3rd axis. The following code seems to work:
mat = reshape(repmat(mat, 1,10000),784,784,10000);
Unfortunately, it takes so long to run it's worthless (changing the 10,000s to 1000 makes it take a few minutes, and using 10,000 makes my whole machine freeze up practically). is there a faster way to do this?
For reference, I'm looking to use mvnpdf on 10,000 vectors each of length 784, using the same covariance matrix for each. So my final call looks like
mvnpdf(X,mu,mat)
%size(X) = (10000,784), size(mu) = (10000,784), size(mat) = 784,784,10000
If there's a way to do this that's not repeating the covariance matrix 10,000 times, that'd be helpful too. Thanks!
For replication in more than 2 dimensions, you need to supply the replication counts as an array:
out = repmat(mat,[1,1,10000])
Creating a 784x784 matrix 10,000 times isn't going to take advantage of the vectorization in MATLAB, which is going to be more useful for small arrays. Avoiding a for loop also won't help too much, given the following:
The main speedup you can gain here is by computing the inverse of the covariance matrix once, and then computing the pdf yourself. The inverse of sigma takes O(n^3), and you are needlessly doing that 10,000 times. (Also, the square root determinant can be precomputed.) For reference, the PDF of the multivariate normal distribution is computed as follows:
http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Properties
Better to just compute the inverse once, and then compute z = x - mu for each value, then doing z'Sz for each pdf value, and applying a simple function and a constant. But wait! You can vectorize that, too.
I don't have MATLAB in front of me, but this is basically what you need to do, and it'll run in an instant.
s = inv(sigma);
c = -0.5*log(det(s)) - (k/2)*log(2*pi);
z = x - mu; % 10000 x 784 matrix
ans = exp( c - 0.5 .* dot(z*s, z, 2) ); % 10000 x 1 vector

How to compute 99% coverage in MATLAB?

I have a matrix in MATLAB and I need to find the 99% value for each column. In other words, the value such that 99% of the population has a larger value than it. Is there a function in MATLAB for this?
Use QUANTILE function.
Y = quantile(X,P);
where X is a matrix and P is scalar or vector of probabilities. For example, if P=0.01, the Y will be vector of values for each columns, so that 99% of column values are larger.
The simplest solution is to use the function QUANTILE as yuk suggested.
Y = quantile(X,0.01);
However, you will need the Statistics Toolbox to use the function QUANTILE. A solution that is not dependent on toolboxes can be found by noting that QUANTILE calls the function PRCTILE, which itself calls the built-in function INTERP1Q to do the primary computation. For the general case of a 2-D matrix that contains no NaN values you can compute the quantiles of each column using the following code:
P = 0.01; %# Your probability
S = sort(X); %# Sort the columns of your data X
N = size(X,1); %# The number of rows of X
Y = interp1q([0 (0.5:(N-0.5))./N 1]',S([1 1:N N],:),P); %'# Get the quantiles
This should give you the same results as calling QUANTILE, without needing any toolboxes.
If you do not have the Statistics Toolbox, there is always
y=sort(x);
y(floor(length(y)*0.99))
or
y(floor(length(y)*0.01))
depending on what you meant.