Spark Scala: compute Pearson correlation of 2 Vectors - scala

I have a DataFrame where each row has 2 vector columns.
Like
ratings1 | ratings2
Vector | Vector
And I'm trying to understand how can I use Statistics.corr method to compute similarity score between those two columns for each row.
As for now, I realised that corr doesn't accept two Vectors as parameters. So, should I create an RDD for each pair of Vectors?

Related

How to measure the pairwise cosine for a data matrix in MATLAB

Assume there is a data matrix (MATLAB)
X = [0.8147, 0.9134, 0.2785, 0.9649, 0.9572;
0.9058, 0.6324, 0.5469, 0.1576, 0.4854;
0.1270, 0.0975, 0.9575, 0.9706, 0.8003]
Each column represent a feature vector for a sample.
What is the fastest way to get the pairwise consine similarity measure in X in MATLAB? such as we want to compute the symmetric S is 5X5 matrix, the element in S(3,4) is the consine between the third column and fourth column.
Note: The consine measurment cos(a,b) means the angle bettween vector a and b.
If you have the Statistics Toolbox, use pdist with the 'cosine' option, followed by squareform. Note that:
pdist considers rows, not columns, as observations. So you need to transpose the input.
The output is 1 minus the cosine similarity. So you need to subtract the result from 1.
To get the result in the form of a symmetric matrix apply squareform.
So, you can use
S = 1 - squareform(pdist(X.', 'cosine'));

K-means on a m x n matrix having strings and numeric data in matlab

I have to perform K means clustering in MATLAB on a mxn matrix containing both string and numeric data and then display it. How do i perform it.
To cluster a matrix of numeric data you can use the following.
idx = kmeans(X,k)
performs k-means clustering to partition the observations of the n-by-p data matrix X into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each observation. Rows of X correspond to points and columns correspond to variables.
More info: http://mathworks.com/help/stats/kmeans.html?requestedDomain=www.mathworks.com
Clustering both numeric and strings is generally not possible..
Can you post (a sample of) your matrix?

How to compute the outer product of two binary vectors

I am generating a random binary matrix with a specific number of ones in each row. Now, I want to take each row in the matrix and multiply it by its transpose (i.e row1'*row1).
So, I am using row1=rnd_mat(1,:) to get the first row. However, in the multiplication step I get this error
"Both logical inputs must be scalar. To compute elementwise TIMES, use TIMES (.*) instead."
Knowing that I don't want to compute element-wise, I want to generate a matrix using the outer product. I tried to write row1 manually using [0 0 1 ...], and tried to find the outer product. I managed to get the matrix I wanted.
So, does anyone have some ideas on how I can do this?
Matrix multiplication of logical matrices or vectors is not supported in MATLAB. That is the reason why you are getting that error. You need to convert your matrix into double or another valid numeric input before attempting to do that operation. Therefore, do something like this:
rnd_mat = double(rnd_mat); %// Cast to double
row1 = rnd_mat(1,:);
result = row1.'*row1;
What you are essentially computing is the outer product of two vectors. If you want to avoid casting to double, consider using bsxfun to do the job for you instead:
result = bsxfun(#times, row1.', row1);
This way, you don't need to cast your matrix before doing the outer product. Remember, the outer product of two vectors is simply an element-wise multiplication of two matrices where one matrix is consists of a row vector where each row is a copy of the row vector while the other matrix is a column vector, where each column is a copy of the column vector.
bsxfun automatically broadcasts each row vector and column vector so that we produce two matrices of compatible dimensions, and performs an element by element multiplication, thus producing the outer product.

How to produce a data set in MATLAB in which variables will correlate to a prespecified level?

I'm looking to create a data set with three columns and an arbitrary number of rows.
I'd like column 1 to have a Pearson correlation .20 with column 2, column 1 to correlate .24 with column 3, and column 2 to correlate .3 with column 3.
How do I do produce this?
you can generate multivariate guassian using mvnrnd and specify covariance to achieve the desired pearson correlation.
see the following documentation:
http://www.mathworks.com/help/stats/mvnrnd.html

Weighted sum of elements in matrix - Matlab?

I have two 50 x 6 matrices, say A and B. I want to assign weights to each element of columns in matrix - more weight to elements occurring earlier in a column and less weight to elements occurring later in the same column...likewise for all 6 columns. Something like this:
cumsum(weight(row)*(A(row,col)-B(row,col)); % cumsum is for cumulative sum of matrix
How can we do it efficiently without using loops?
If you have your weight vector w as a 50x1 vector, then you can rewrite your code as
cumsum(repmat(w,1,6).*(A-B))
BTW, I don't know why you have the cumsum operating on a scalar in a loop... it has no effect. I'm assuming that you meant that's what you wanted to do with the entire matrix. Calling cumsum on a matrix will operate along each column by default. If you need to operate along the rows, you should call it with the optional dimension argument as cumsum(x,2), where x is whatever matrix you have.