null aware covariance matrix in KDB - kdb

I'm working through Funq by Nick Psaris but need a null aware function for making a covariance matrix. I will use it to create a mahalanobis distance matrix.
He supplies some null aware functions:
navg = null aware avg
nvar null aware var
nsvar = null aware sample var
Any suggestions?

Here's what I've come up with:
nSCovMatrix:{[matrix]
//takes table or array
//null aware Sample Covariance Matrix
$[98h=type matrix;matrix:"f"$(flip matrix[cols matrix]);matrix:"f"$matrix];
corMatrix:u cor/:\:u: flip matrix;
sd:sqrt nsvar[matrix]; //standard dev of each variable
diagMatrix:sd*{x=/:x}til first count each dataSet;
:covMatrix:(diagMatrix) mmu corMatrix mmu (diagMatrix);
};
I find the correlation matrix and turn it into the covariance matrix by taking the stnd dev of each variable, making a diagonal matrix with the s.d., and multiplying it against the correlation matrix twice. Far from optimized though

Related

Access multiple elements and assign to each selected element a different value

I need to know if there is any efficient way of doing the following in MATLAB.
I have several big sparse matrices, the size of each one is roughly 9000000x9000000.
I need to access multiple element of such matrix and assign to each selected element a different value stored in another array. I'll give an example:
What I have:
SPARSE MATRIX of size 9000000x9000000
Matrix with the list of indexes and values I want to access, this is a matrix like this:
[row1, col1, value1;
row2, col2, value2;
...
rowN, colN, valueN]
Where N is the length of such matrix.
What I need:
Assign to the SPARSE MATRIX the corresponding value to the corresponding index, this is:
SPARSE_MATRIX(row1, col1) = value1
SPARSE_MATRIX(row2, col2) = value2
...
SPARSE_MATRIX(rowN, colN) = valueN
Thanks in advance!
EDIT:
Thank you to both for answering, I think I did not explain myself well, I'll try again.
I already have a large SPARSE MATRIX of about 9000000 rows x 9000000 columns, it is a SPARSE MATRIX filled with zeros.
Then I have another array or matrix, let's call it M with N number of rows, where N could take values from 0 to 9000000; and 3 columns. The first two columns are used to index an element of my SPARSE MATRIX, and the third column stores the value I want to transfer to the SPARSE MATRIX, this is, given a random row of M, i:
SPARSE_MATRIX(M(i, 1), M(i, 2)) = M(i, 3)
The idea is to do that for all the rows, I have tried it with common indexing:
SPARSE_MATRIX(M(:, 1), M(:, 2)) = M(:, 3)
Now I would like to do this assignation for all the rows in M as fast as possible, because if I use a loop or common indexing it takes ages (I am using a 7th Gen i7 processor with 16 GB of RAM). And I also need to keep the zeros in the SPARSE_MATRIX.
EDIT 2: SOLVED! Thank you Metahominid, I was not thinking through, but yes the sparse function does solve my problem, I just think my brain circuits were shortcircuited yesterday and was unable to see through it hahaha. Thank you to both anyway!
Regards!
You can construct a sparse matrix like this.
A = sparse(i,j,v)
S = sparse(i,j,v) generates a sparse matrix S from the triplets i, j,
and v such that S(i(k),j(k)) = v(k). The max(i)-by-max(j) output
matrix has space allotted for length(v) nonzero elements. sparse adds
together elements in v that have duplicate subscripts in i and j.
So you can simply construct the row vector, column vector and value vector.
I am answering in part because I cannot comment. You question seems a little confusing to me. The sparse() function in MATLAB does just this.
You can enter your arrays of indices and values directly into the interface, or declare a sparse matrix of zeros and set each individually.
Given your data format make three vectors, ROWS = [row1; ...; rown], COLS = [col1; ...; coln], and DATA = [val1; ... valn]. I am assuming that your size is the overall size of the full matrix and not the sparse portion.
Then
A = sparse(ROWS, COLS, DATA) will do just what you want. You can even specify the original matrix size.
A = sparse(ROWS, COLS, DATA, 90...., 90....).

Spark out of memory when reducing by key

I'm working on a algorithm that requires math operations on large matrix. Basically, the algorithm involves the following steps:
Inputs: two vectors u and v of size n
For each vector, compute pairwise Euclidean distance between elements in the vector. Return two matrix E_u and E_v
For each entry in the two matrices, apply a function f. Return two matrix M_u, M_v
Find the eigen values and eigen vectors of M_u. Return e_i, ev_i for i = 0,...,n-1
Compute the outer product for each eigen vector. Return a matrix O_i = e_i*transpose(e_i), i = 0,...,n-1
Adjust each eigen value with e_i = e_i + delta_i, where delta_i = sum all elements(elementwise product of O_i and M_v)/2*mu, where mu is a parameter
Final return a matrix A = elementwise sum (e_i * O_i) over i = 0,...,n-1
The issue I'm facing is mainly the memory when n is large (15000 or more), since all matrices here are dense matrices. My current way to implement this may not be the best, and partially worked.
I used a RowMatrix for M_u and get eigen decomposition using SVD.
The resulting U factor of SVD is a row matrix whose columns are ev_i's, so I have to manually transpose it so that its rows become ev_i. The resulting e vector is the eigen values e_i.
Since a previous attempt of directly mapping each row ev_i to O_i failed due to out of memory, I'm currently doing
R = U.map{
case(i,ev_i) => {
(i, ev_i.toArray.zipWithIndex)
}
}//add index for each element in a vector
.flatMapValues(x=>x)}
.join(U)//eigen vectors column is appended
.map{case(eigenVecId, ((vecElement,elementId), eigenVec))=>(elementId, (eigenVecId, vecElement*eigenVec))}
To compute adjusted e_i's in step 5 above, M_v is stored as rdd of tuples (i, denseVector). Then
deltaRdd = R.join(M_v)
.map{
case(j,((i,row_j_of_O_i),row_j_of_M_v))=>
(i,row_j_of_O_i.t*DenseVector(row_j_of_M_v.toArray)/(2*mu))
}.reduceByKey(_+_)
Finally, to compute A, again due to memory issue, I have to first joining rows from different rdds and then reducing by key. Specifically,
R_rearranged = R.map{case(j, (i, row_j_of_O_i))=>(i,(j,row_j_of_O_i))}
termsForA = R_rearranged.join(deltaRdd)
A = termsForA.map{
case(i,(j,row_j_of_O_i), delta_i)) => (j, (delta_i + e(i))*row_j_of_O_i)
}
.reduceByKey(_+_)
The above implementation worked to the step of termsForA, which means if I execute an action on termsForA like termsForA.take(1).foreach(println), it succeeded. But if I execute an action on A, like A.count(), an OOM error occured on driver.
I tried to tune sparks configuration to increase driver memory as well as parallelism level, but all failed.
Use IndexedRowMatrix instead of RowMatrix, it will help in conversions and transpose.
Suppose your IndexedRowMatrix is Irm
svd = Irm.computeSVD(k, True)
U = svd.U
U = U.toCoordinateMatrix().transpose().toIndexedRowMatrix()
You can convert Irm to BlockMatrix for multiplication with another distributed BlockMatrix.
I guess at some point Spark decided there's no need to carry out operations on executors, and do all the work on driver. Actually, termsForA would fail as well in action like count. Somehow I made it work by broadcasting deltaRdd and e.

Matrix indices for creating sparse matrix

I want to create a 4 by 4 sparse matrix A. I want assign values (e.g. 1) to following entries:
A(2,1), A(3,1), A(4,1)
A(2,2), A(3,2), A(4,2)
A(2,3), A(3,3), A(4,3)
A(2,4), A(3,4), A(4,4)
According to the manual page, I know that I should store the indices by row and column respectively. That is, for row indices,
r=[2,2,2,2,3,3,3,3,4,4,4,4]
Also, for column indices
c=[1,2,3,4,1,2,3,4,1,2,3,4]
Since I want to assign 1 to each of the entries, so I use
value = ones(1,length(r))
Then, my sparse matrix will be
Matrix = sparse(r,c,value,4,4)
My problem is this:
Indeed, I want to construct a square matrix of arbitrary dimension. Says, if it is a 10 by 10 matrix, then my column vector will be
[1,2,..., 10, 1,2, ..., 10, 1,...,10, 1,...10]
For row vector, it will be
[2,2,...,2,3,3,...,3,...,10, 10, ...,10]
I would like to ask if there is a quick way to build these column and row vector in an efficient manner? Thanks in advance.
I think the question aims to create vectors c,r in an easy way.
n = 4;
c = repmat(1:n,1,n-1);
r = reshape(repmat(2:n,n,1),1,[]);
Matrix = sparse(r,c,value,n,n);
This will create your specified vectors in general.
However as pointed out by others full sparse matrixes are not very efficient due to overhead. If I recall correctly a sparse matrix offers advantages if the density is lower than 25%. Having everything except the first row will result in slower performance.
You can sparse a matrix after creating its full version.
A = (10,10);
A(1,:) = 0;
B = sparse(A);

In MATLAB, how to generate a complex positive definite matrix whose diagonal elements belong to the interval (0,1]?

I am currently using the following code to generate a real positive definite matrix of size n.
A = (mvnrnd(zeros(n,1), eye(n), n))';
How do I generate for complex entries with the same constraint that all the diagonal elements are between (0,1]?
I tried something and get this:
A = (mvnrnd(zeros(n,1), eye(n), n))'
A = A+A'
A = A + 4*n*eye(n)
C = rand(n)
C=C-C'
D = A+i*C
chol(D)
Using your distribution parameters generate random A matrix. Make this symmetric, add elements at main diagonal, create complex part, sum them. This describes a 4sigma probability interval of getting positive define matrix.
But my method has one weak point - it based on symmetric and skew-simmetric matrices. Is it ok for you?

MATLAB: Find largest singular values of a matrix and corresponding singular vectors

I have matrix A and matrices [U,S,V], such that [U, S, V] = svd(A).
How could I amend my script in matlab to get the 10 columns of U that correspond to the 10 largest singular values of A (i.e. the largest values in S)?
If you recall the definition of the svd, it is essentially solving an eigenvalue problem such that:
Av = su
v is a right-eigenvector from the matrix V and u is a left-eigenvector from the matrix U. s is a singular value from the matrix S. You know that S is a diagonal matrix where these are the singular values sorted in descending order. As such, if we took the first column of v and the first column of u, as well as the first singular value of s (top-left corner), if we did the above computation, we should get both outputs to be the same.
As an example:
rng(123);
A = randn(4,4);
[U,S,V] = svd(A);
format long;
o1 = A*V(:,1);
o2 = S(1,1)*U(:,1);
disp(o1);
disp(o2);
-0.267557887773137
1.758696945035771
0.934255531699997
-0.978346339659143
-0.267557887773136
1.758696945035771
0.934255531699996
-0.978346339659143
Similarly, if we looked at the second columns of U and V with the second singular value:
o1 = A*V(:,2);
o2 = S(2,2)*U(:,2);
disp(o1);
disp(o2);
0.353422275717823
-0.424888938462465
1.543570300948254
0.613563185406719
0.353422275717823
-0.424888938462465
1.543570300948252
0.613563185406719
As such, the basis vectors are arranged such that the columns are arranged from left to right in the same order as what the singular values dictate. As such, you simply grab the first 10 columns of U. This can be done by:
out = U(:,1:10);
out would contain the basis vectors, or columns of U that correspond to the 10 highest singular values of A.
First you sort the singular values, and save the reindexing, then take the first 10 values:
[a, b]=sort(diag(S));
Umax10=U(:,b(1:10));
As mentioned by Rayryeng, svd outputs the singular values in decreasing order so:
Umax10=U(:,1:10);
is enough.
Just have to remember that it is not the case for eig, even though it may seems that eig also outputs ordered eigenvalues it is not always the case.