Let's say I have a sparse matrix A. I want to do heavy calculations on it. The calculations do not modify A, only accessing its elements e.g. take a row of A then multiply with something. I wonder whether I should convert A to a full matrix before doing any calculation, or just do it directly?
In other words, is accessing elements in a sparse matrix slower than a full matrix?
Slicing sparse matrices across columns is much faster than slicing across rows in MATLAB. So you should prefer accessing M(:,i) over M(i,:).
Internally MATLAB uses the Compressed Sparse Column (CSC) storage:
the non-zero elements are stored in a 1-D double array of length nzmax arranged in column-major order (pr for real part and pi for imaginary part if matrix is complex)
ir an array of integers with corresponding row indices
jc an integer array of length n+1 (where n is the number of columns) containing column index information. By definition, the last value of jc contains nnz (number of non-zeros stored).
As an example, take the following sparse matrix:
>> M = sparse([1 3 5 3 4 1 5], [1 1 1 2 2 4 4], [1 7 5 3 4 2 6], 5, 4);
This is what it looks like stored in memory (I'm using 0-based indices for ir and jc):
1 . . 2
. . . .
7 3 . .
. 4 . .
5 . . 6
pr = 1 7 5 3 4 2 6
ir = 0 2 4 2 3 0 4
jc = 0 3 5 5 7
nzmax = at least 7
nnz = 7
m = 5
n = 4
To retrieve the i-th column of the sparse matrix M(:,i), it suffices to do: pr(jc(i):jc(i+1)-1) (to keep it simple, I'm not paying attention to 0- vs 1-based indexing). On the other hand, accessing a matrix row involves more calculations and array traversals (it is no longer spatial-locality friendly).
Here are some links to the MATLAB documentation for more info: Sparse Matrices, Manipulating Sparse Matrices
It's worth checking out the original paper by John R. Gilbert, Cleve Moler, and Robert Schreiber: "Sparse Matrices In Matlab: Design and Implementation", (SIAM Journal on Matrix Analysis and Applications , 13:1, 333–356 (1992)).
Here are some quotes from the above paper to answer your question about the overhead of sparse storage:
The computational complexity of simple array operations should be
proportional to nnz, and perhaps also depend linearly on m or n,
but be independent of the product m*n. The complexity of more
complicated operations involves such factors as ordering and fill-in,
but an objective of a good sparse matrix algorithm should be:
The time required for a sparse matrix operation should be proportional
to number of arithmetic operations on nonzero quantities.
We call this the "time is proportional to flops" rule; it is a
fundamental tenet of our design.
and
This (column-oriented sparse matrix) scheme is not efficient
for manipulating matrices on element at a time: access to a single
element takes time at least proportional to the logarithm of the
length of its column; inserting or removing a nonzero may require
extensive data movement. However, element-by-element manipulation is
rare in MATLAB (and is expensive even in full MATLAB). Its most common
application would be to create a sparse matrix, but this is more
efficiently done by building a list [i,j,s] of matrix elements in
arbitrary order and then using sparse(i,j,s) to create the matrix.
The sparse data structure is allowed to have unused elements after the
end of the last column of the matrix. Thus an algorithm that builds up
a matrix one column at a time can be implemented efficiently by
allocating enough space for all the expected nonzeros at the outset.
Also section 3.1.4 "asymptotic complexity analysis" should be of interest (it's too long to quote here).
Related
I have a big matrix (1,000 rows and 50,000 columns). I know some columns are correlated (the rank is only 100) and I suspect some columns are even proportional. How can I find such proportional columns? (one way would be looping corr(M(:,j),M(:,k))), but is there anything more efficient?
If I am understanding your problem correctly, you wish to determine those columns in your matrix that are linearly dependent, which means that one column is proportional or a scalar multiple of another. There's a very basic algorithm based on QR Decomposition. For QR decomposition, you can take any matrix and decompose it into a product of two matrices: Q and R. In other words:
A = Q*R
Q is an orthogonal matrix with each column as being a unit vector, such that multiplying Q by its transpose gives you the identity matrix (Q^{T}*Q = I). R is a right-triangular or upper-triangular matrix. One very useful theory by Golub and Van Loan in their 1996 book: Matrix Computations is that a matrix is considered full rank if all of the values of diagonal elements of R are non-zero. Because of the floating point precision on computers, we will have to threshold and check for any values in the diagonal of R that are greater than this tolerance. If it is, then this corresponding column is an independent column. We can simply find the absolute value of all of the diagonals, then check to see if they're greater than some tolerance.
We can slightly modify this so that we would search for values that are less than the tolerance which would mean that the column is not independent. The way you would call up the QR factorization is in this way:
[Q,R] = qr(A, 0);
Q and R are what I just talked about, and you specify the matrix A as input. The second parameter 0 stands for producing an economy-size version of Q and R, where if this matrix was rectangular (like in your case), this would return a square matrix where the dimensions are the largest of the two sizes. In other words, if I had a matrix like 5 x 8, producing an economy-size matrix will give you an output of 5 x 8, where as not specifying the 0 will give you an 8 x 8 matrix.
Now, what we actually need is this style of invocation:
[Q,R,E] = qr(A, 0);
In this case, E would be a permutation vector, such that:
A(:,E) = Q*R;
The reason why this is useful is because it orders the columns of Q and R in such a way that the first column of the re-ordered version is the most probable column that is independent, followed by those columns in decreasing order of "strength". As such, E would tell you how likely each column is linearly independent and those "strengths" are in decreasing order. This "strength" is exactly captured in the diagonals of R corresponding to this re-ordering. In fact, the strength is proportional to this first element. What you should do is check to see what diagonals of R in the re-arranged version are greater than this first coefficient scaled by the tolerance and you use these to determine which of the corresponding columns are linearly independent.
However, I'm going to flip this around and determine the point in the R diagonals where the last possible independent columns are located. Anything after this point would be considered linearly dependent. This is essentially the same as checking to see if any diagonals are less than the threshold, but we are using the re-ordering of the matrix to our advantage.
In any case, putting what I have mentioned in code, this is what you should do, assuming your matrix is stored in A:
%// Step #1 - Define tolerance
tol = 1e-10;
%// Step #2 - Do QR Factorization
[Q, R, E] = qr(A,0);
diag_R = abs(diag(R)); %// Extract diagonals of R
%// Step #3 -
%// Find the LAST column in the re-arranged result that
%// satisfies the linearly independent property
r = find(diag_R >= tol*diag_R(1), 1, 'last');
%// Step #4
%// Anything after r means that the columns are
%// linearly dependent, so let's output those columns to the
%// user
idx = sort(E(r+1:end));
Note that E will be a permutation vector, and I'm assuming you want these to be sorted so that's why we sort them after the point where the vectors fail to become linearly independent anymore. Let's test out this theory. Suppose I have this matrix:
A =
1 1 2 0
2 2 4 9
3 3 6 7
4 4 8 3
You can see that the first two columns are the same, and the third column is a multiple of the first or second. You would just have to multiply either one by 2 to get the result. If we run through the above code, this is what I get:
idx =
1 2
If you also take a look at E, this is what I get:
E =
4 3 2 1
This means that column 4 was the "best" linearly independent column, followed by column 3. Because we returned [1,2] as the linearly dependent columns, this means that columns 1 and 2 that both have [1,2,3,4] as their columns are a scalar multiple of some other column. In this case, this would be column 3 as columns 1 and 2 are half of column 3.
Hope this helps!
Alternative Method
If you don't want to do any QR factorization, then I can suggest reducing your matrix into its row-reduced Echelon form, and you can determine the basis vectors that make up the column space of your matrix A. Essentially, the column space gives you the minimum set of columns that can generate all possible linear combinations of output vectors if you were to apply this matrix using matrix-vector multiplication. You can determine which columns form the column space by using the rref command. You would provide a second output to rref that gives you a vector of elements that tell you which columns are linearly independent or form a basis of the column space for that matrix. As such:
[B,RB] = rref(A);
RB would give you the locations of which columns for the column space and B would be the row-reduced echelon form of the matrix A. Because you want to find those columns that are linearly dependent, you would want to return a set of elements that don't contain these locations. As such, define a linearly increasing vector from 1 to as many columns as you have, then use RB to remove these entries in this vector and the result would be those linearly dependent columns you are seeking. In other words:
[B,RB] = rref(A);
idx = 1 : size(A,2);
idx(RB) = [];
By using the above code, this is what we get:
idx =
2 3
Bear in mind that we identified columns 2 and 3 to be linearly dependent, which makes sense as both are multiples of column 1. The identification of which columns are linearly dependent are different in comparison to the QR factorization method, as QR orders the columns based on how likely that particular column is linearly independent. Because columns 1 to 3 are related to each other, it shouldn't matter which column you return. One of these forms the basis of the other two.
I haven't tested the efficiency of using rref in comparison to the QR method. I suspect that rref does Gaussian row eliminations, where the complexity is worse compared to doing QR factorization as that algorithm is highly optimized and efficient. Because your matrix is rather large, I would stick to the QR method, but use rref anyway and see what you get!
If you normalize each column by dividing by its maximum, proportionality becomes equality. This makes the problem easier.
Now, to test for equality you can use a single (outer) loop over columns; the inner loop is easily vectorized with bsxfun. For greater speed, compare each column only with the columns to its right.
Also to save some time, the result matrix is preallocated to an approximate size (you should set that). If the approximate size is wrong, the only penalty will be a little slower speed, but the code works.
As usual, tests for equality between floating-point values should include a tolerance.
The result is given as a 2-column matrix (S), where each row contains the indices of two rows that are proportional.
A = [1 5 2 6 3 1
2 5 4 7 6 1
3 5 6 8 9 1]; %// example data matrix
tol = 1e-6; %// relative tolerance
A = bsxfun(#rdivide, A, max(A,[],1)); %// normalize A
C = size(A,2);
S = NaN(round(C^1.5),2); %// preallocate result to *approximate* size
used = 0; %// number of rows of S already used
for c = 1:C
ind = c+find(all(abs(bsxfun(#rdivide, A(:,c), A(:,c+1:end))-1)<tol));
u = numel(ind); %// number of columns proportional to column c
S(used+1:used+u,1) = c; %// fill in result
S(used+1:used+u,2) = ind; %// fill in result
used = used + u; %// update number of results
end
S = S(1:used,:); %// remove unused rows of S
In this example, the result is
S =
1 3
1 5
2 6
3 5
meaning column 1 is proportional to column 3; column 1 is proportional to column 5 etc.
If the determinant of a matrix is zero, then the columns are proportional.
There are 50,000 Columns, or 2 times 25,000 Columns.
It is easiest to solve the determinant of a 2 by 2 matrix.
Hence: to Find the proportional matrix, the longest time solution is to
define the big-matrix on a spreadsheet.
Apply the determinant formula to a square beginning from 1st square on the left.
Copy it for every row & column to arrive at the answer in the next spreadsheet.
Find the Columns with Determinants Zero.
This is Quite basic,not very time consuming and should be result oriented.
Manual or Excel SpreadSheet(Efficient)
I have a doubt about SVD. in the literature that i had read, it's written that we have to convert our input matrix into covariance matrix first, and then SVD function from matlab (SVD) is used.
But, in Mathworks website we can use SVD function directly to the input matrix (no need to convert it into covariance matrix)..
[U,S,V]=svd(inImageD);
Which one is the true??
And if we want to do dimensionality reduction, we have to project our data into eigen vector.. But where is the eigen vector generated by SVD function..
I know that S is the eigen value.. But what is U and S??
To reduce our data dimensional, do we need to substract the input matrix with its mean and then multiply it with eigen vector?? or we can just multiply our input matrix with the eigen vector (no need to substract it first with its mean)..
EDIT
Suppose if I want to do classification using SIFT as the features and SVM as the classifier.
I have 10 images for training and I arrange them in a different row..
So 1st row for 1st images, 2nd row for second images and so on...
Feat=[1 2 5 6 7 >> Images1
2 9 0 6 5 >> Images2
3 4 7 8 2 >> Images3
2 3 6 3 1 >> Images4
..
.
so on. . ]
To do dimensionality reduction (from my 10x5 matrix), we have yo do A*EigenVector
And from what U had explained (#Sam Roberts), I can compute it by using EIGS function from the covariance matrix (instead of using SVD function).
And as I arrange the feat of images in different row, so I need to do A'*A
So it becomes:
Matrix=A'*A
MAT_Cov=Cov(Matrix)
[EigVector EigValue] = eigs (MAT_Cov);
is that right??
Eigenvector decomposition (EVD) and singular value decomposition (SVD) are closely related.
Let's say you have some data a = rand(3,4);. Note that this not a square matrix - it represents a dataset of observations (rows) and variables (columns).
Do the following:
[u1,s1,v1] = svd(a);
[u2,s2,v2] = svd(a');
[e1,d1] = eig(a*a');
[e2,d2] = eig(a'*a);
Now note a few things.
Up to the sign (+/-), which is arbitrary, u1 is the same as v2. Up to a sign and an ordering of the columns, they are also equal to e1. (Note that there may be some very very tiny numerical differences as well, due to slight differences in the svd and eig algorithms).
Similarly, u2 is the same as v1 and e2.
s1 equals s2, and apart from some extra columns and rows of zeros, both also equal sqrt(d1) and sqrt(d2). Again, there may be some very tiny numerical differences as well just due to algorithmic issues (they'll be on the order of something to the -10 or so).
Note also that a*a' is basically the covariances of the rows, and a'*a is basically the covariances of the columns (that's not quite true - a would need to be centred first by subtracting the column or row mean for them to be equal, and there might be a multiplicative constant difference as well, but it's basically pretty similar).
Now to answer your questions, I assume that what you're really trying to do is PCA. You can do PCA either by taking the original data matrix and applying SVD, or by taking its covariance matrix and applying EVD. Note that Statistics Toolbox has two functions for PCA - pca (in older versions princomp) and pcacov.
Both do essentially the same thing, but from different starting points, because of the above equivalences between SVD and EVD.
Strictly speaking, u1, v1, u2 and v2 above are not eigenvectors, they are singular vectors - and s1 and s2 are singular values. They are singular vectors/values of the matrix a. e1 and d1 are the eigenvectors and eigenvalues of a*a' (not a), and e2 and d2 are the eigenvectors and eigenvalues of a'*a (not a). a does not have any eigenvectors - only square matrices have eigenvectors.
Centring by subtracting the mean is a separate issue - you would typically do that prior to PCA, but there are situations where you wouldn't want to. You might also want to normalise by dividing by the standard deviation but again, you wouldn't always want to - it depends what the data represents and what question you're trying to answer.
Matlab has a built-in function for calculating rank of a matrix with decimal numbers as well as finite field numbers. However if I am not wrong they calculate only the lowest rank (least of row rank and column rank). I would like to calculate only the row rank, i.e. find the number of independent rows of a matrix (finite field in my case). Is there a function or way to do this?
In linear algebra the column rank and the row rank are always equal (see proof), so just use rank
(if you're computing the the rank of a matrix over Galois fields, consider using gfrank instead, like #DanBecker suggested in his comment):
Example:
>> A = [1 2 3; 4 5 6]
A =
1 2 3
4 5 6
>> rank(A)
ans =
2
Perhaps all three columns seem to be linearly independent, but they are dependent:
[1 2; 4 5] \ [3; 6]
ans =
-1
2
meaning that -1 * [1; 4] + 2 * [2; 5] = [3; 6]
Schwartz,
Two comments:
You state in a comment "The rank function works just fine in Galois fields as well!" I don't think this is correct. Consider the example given on the documentation page for gfrank:
A = [1 0 1;
2 1 0;
0 1 1];
gfrank(A,3) % gives answer 2
rank(A) % gives answer 3
But it is possible I am misunderstanding things!
You also said "How to check if the rows of a matrix are linearly independent? Does the solution I posted above seem legit to you i.e. taking each row and finding its rank with all the other rows one by one?"
I don't know why you say "find its rank with all the other rows one by one". It is possible to have a set of vectors which are pairwise linearly independent, yet linearly dependent taken as a group. Just consider the vectors [0 1], [1 0], [1 1]. No vector is a multiple of any other, yet the set is not linearly independent.
Your problem appears to be that you have a set of vector that you know are linearly independent. You add a vector to that set, and want to know whether the new set is still linearly independent. As #EitanT said, all you need to do is combine the (row) vectors into a matrix and check whether its rank (or gfrank) is equal to the number of rows. No need to do anything "one-by-one".
Since you know that the "old" set is linearly independent, perhaps there is a nice fast algorithm to check whether the new vector makes thing linearly dependent. Maybe at each step you orthogonalize the set, and perhaps that would make the process of checking for linear independence given the new vector faster. That might make an interesting question somewhere like mathoverflow.
preface: As the matlab guiderules state, Usually, when one wants to efficiently populate a sparse matrix in matlab, he should create a vector of indexes into the matrix and a vector of values he wants to assign, and then concentrate all the assignments into one atomic operation, so as to allow matlab to "prepare" the matrix in advance and optimize the assignment speed. A simple example:
A=sparse([]);
inds=some_index_generating_method();
vals=some_value_generating_method();
A(inds)=vals;
My question: what can I do in the case where inds contain overlapping indexes, i.e inds=[4 17 8 17 9] where 17 repeats twice.
In this case, what I would want to happen is that the matrix would be assigned the addition of all the values that are mapped to the same index, i.e for the previous example
A(17)=vals(2)+vals(4) %as inds(2)==inds(4)
Is there any straightforward and, most importantly, fast way to achieve this? I have no way of generating the indexes and values in a "smarter" way.
This might help:
S = sparse(i,j,s,m,n,nzmax) uses vectors i, j, and s to generate an m-by-n sparse matrix such that S(i(k),j(k)) = s(k), with space allocated for nzmax nonzeros. Vectors i, j, and s are all the same length. Any elements of s that are zero are ignored, along with the corresponding values of i and j. Any elements of s that have duplicate values of i and j are added together.
See more at MATLAB documentation for sparse function
I have a sparse logical matrix, which is quite large. I would like to draw random non-zero elements from it without storing all of its non-zero elements in a separate vector (eg. by using find command). Is there an easy way to do this?
Currently I am implementing rejection sampling, which is drawing a random element and checking whether that is non-zero or not. But it is not efficient when the ratio of non-zero elements is small.
A sparse logical matrix is not a very practical representation of your data if you want to pick random locations. Rejection sampling and find are the only two ways that make sense to me. Here's how you can do them efficiently (assuming you want to get 4 random locations):
%# using find
idx = find(S);
%# draw 4 without replacement
fourRandomIdx = idx(randperm(length(idx),4));
%# draw 4 with replacement
fourRandomIdx = idx(randi(1,length(idx),4));
%# get row, column values
[row,col] = ind2sub(size(S),fourRandomIdx);
%# using rejection sampling
density = nnz(S)/prod(size(S));
%# estimate how many samples you need to get at least 4 hits
%# and multiply by 2 (or 3)
n = ceil( 1 / (1-(1-density)^4) ) * 2;
%# random indices w/ replacement
randIdx = randi(1,n,prod(size(S)));
%# identify the first four non-zero elements
[row,col] = find(S(randIdx),4,'first');
An n x m matrix with nnz non-zero elements requires nnz + n + 1 integers to store the locations of its non-zero entries. For a logical matrix there is no need to store the value of the non-zero entries: these are all true. Correspondingly, you would do best to convert your logical sparse matrix into a list of the linear indices of its non-zero entries, together with n and m, which requires only nnz + 2 integers of storage. From these (and ind2sub) you can readily reconstruct the subscripts corresponding to any non-zero entry that you choose randomly using randi over the range 1..nnz
find is the standard interface to get the non-zero elements in a sparse matrix. Have a look here http://www.mathworks.se/help/techdoc/math/f6-9182.html#f6-13040
[i,j,s] = find(S)
find returns the row indices of nonzero values in vector i, the column indices in vector j, and the nonzero values themselves in the vector s.
No need to get s. Just pick a random index in i,j.
By representing the entries in a 3 column format, aka a coordinate list (i, j, value), you can simply select the items from the list. To get this, you can either use your original method for creating the sparse matrix (i.e. the precursor to sparse()), or use the find command, a la [i,j,s] = find(S);
If you don't need the entries, and it seems you don't, you can just extract i and j.
If, for some reason, your matrix is massive and your RAM limitations are severe, you can simply divide the matrix into regions, and let the probability of selecting a given sub-matrix be proportional to the number of non-zero elements (using nnz) in that sub-matrix. You could go so far as to divide the matrix into individual columns, and the rest of the calculation is trivial. NB: by applying sum to the matrix, you can get the per-column counts (assuming your entries are just 1s).
In this way, you need not even bother with rejection sampling (which seems pointless to me in this case, since Matlab knows where all of the non-zero entries are).