Matrix multiplication in py-spark using RDD - pyspark

I have two matrices
# 3x3 matrix
X = [[10,7,3],[3 ,2,6],[5 ,8,7]]
# 3x4 matrix
Y = [[3,7,11,2],[2,7,4,10],[8,7,6,11]]
I want to multiply these two in spark using RDD. Can some one help me on this. This multiplication should not use any inbuilt function.
I was able to multiply the 2 using for loop in python as follows
for i in range(len(X)):
# iterate through columns of Y
for j in range(len(Y[0])):
# iterate through rows of Y
for k in range(len(Y)):
Output[i][j] += X[i][k] * Y[k][j]
#output is a 3*4 empty matrix
I am new to spark and using pyspark.

It is not so hard, you just have to write your matrix using a different notation.
X = [[10,7,3],[3 ,2,6],[5 ,8,7]]
Can be written as
X = (0,0,10),(0,1,7),(0,2,3)...
rdd_x = sc.parallelize((0,0,10),(0,1,7),(0,2,3)...)
rdd_y = sc.parallelize((0,0,3),(0,1,7),(0,2,11)...)
Now you can make the multiplication both using join or cartesian.
E.g.,
rdd_x.cartesian(rdd_y)\
.filter(lambda x: x [0][0] == x[1][1] and x[0][1] == x[1][0])\
.map(lambda x: (x[0][0],x[0][2] * x[1][2])).reduceByKey(lambda x,y: x+y).collect()

Your code works, but you should initialize Output, and only once,
Output=[[0]*4]*3
You're not using RDDs though, your teacher won't be happy.

Based on Andrea's answer, I came up with this solution:
rdd_x.cartesian(rdd_y)\
.filter(lambda x: (x[0][1] == x[1][0]))\
.map(lambda x: ((x[0][0],x[1][1]),x[0][2] * x[1][2])).reduceByKey(lambda x,y: x+y).collect()

Related

Trying to solve an error problem with Matlab v 9.13 when doing matrix multiplication

Here is the problem and current code I am using.
Use the svd() function in MATLAB to compute , the rank-2 approximation of . Clearly state what is, rounded to 4 decimal places. Also, compute the root-mean square error (RMSE) between and . Which approximation is better, or ? Explain.
Solution:
%code
[U , S, V] = svd(A)
k = 2
V = V(:,1:k)
V = transpose(V)
Ak = U(:,1:k) .* S(1:k,1:k) .* V
diffA = A - A2
fro_norm = norm(diffA,'fro')
RMSE2 = (fro_norm)/sqrt(m*n)
However, when running, the line AK = . . . keeps giving an error because the matrix sizes are not compatible. So I understand that the matrix sizes need to match in order to do the multiplication, but I also know that the problem requires the following calculation requirements, that when k = 2, U has to use the first 2 columns, S has to use the first 2 rows and first 2 columns, and V has to be the transpose of V only using the first two columns.
I must be missing something in my understanding of the calculation, or creation of the sub k matrices. The matrix I have to use is a 3 x 3.

Reference elements in matrices in Julia while assigning values in other matrices

I have a vector X whose elements are zeros and ones. I want to create another vector Z of the same size as X where each element of Z is 0 if the corresponding element in X is zero, otherwise it is a random draw from a. uniform distribution. In Matlab I can easily do this by:
n = 1000;
X = randi([0, 1], [1, n]);
Z(X) = rand(); #Here wherever X takes a value of 1, the element of Z is a draw from a uniform distribution.
I want to implement this in Julia. Is there a cleaner way of doing this instead of using if conditionals. Thanks!!
Here's one way to do it:
julia> n = 1000;
julia> x = rand(Bool, n);
julia> z = zeros(n);
julia> using Distributions
julia> z[x] .= rand.(Uniform(-10, 10));
julia> z
100-element Vector{Float64}:
-2.6946644136672004
0.0
0.0
⋮
You can adjust the parameters of the Uniform distribution to what you need, or leave that argument out if the default [0, 1) range is what you need.
The line z[x] .= rand.(Uniform(-10, 10)) uses Julia's logical indexing (same as MATLAB) and broadcasting features - for every x value that is true, the rand call is made and the result assigned to that element of z.
The advantage of using the broadcast (compared to creating rand(Uniform(-10, 10), count(x)) and assigning that to z[x] for eg.) is that the values are directly assigned in-place to their destination in z, and so there's no extra unnecessary memory allocated (as mentioned by #DNF in the comments).
First of all, your Matlab code doesn't work in Matlab, for two reasons: Firstly because logical indices must be boolean, they cannot be 0 and 1. And secondly, because Z(X) = rand() will draw only a single random number and assign it to all the corresponding elements of Z.
Instead, you may want something like this Matlab code:
X = rand(1, n) > 0.5
Z(X) = rand(sum(X), 1)
In Julia you could do
X = rand(Bool, n)
Z = float.(X) # you have to initialize Z
Z[X] .= rand.()
Edit: Here's an alternative with a comprehension, where you don't need to initialize Z:
X = rand(Bool, n)
Z = [x ? float(x) : rand() for x in X]
Technically, what you are sampling from here is a left-censored uniform distribution -- equivalent to the mixture of a Dirac distibution at 0 and Uniform(0, 1). The next release of Distributions.jl will have an implementation for censored, which will remove the need to do any fancy assignment at all:
Z = rand(censored(Uniform(-1.0, 1.0), lower=0.0), N)
where the extent to the left is chosen so that the mixture components have equal weight.

Cosine similarity of two sparse vectors in Scala Spark

I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.
However, I haven't been able to find any library or tutorial to do it for Sparse vectors.
The only way I found is the following:
Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].
Convert this k X n matrix to RowMatrix.
Use columnSimilarities() function to get a n X n matrix of similarities between n items.
I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.
In Spark 3 there is now method dot for a SparseVector object, which takes another vector as its argument.
If you want to do this in earlier versions, you could create a user defined function that follows this algorithm:
Take intersection of your vectors' indices.
Get two subarrays of your vectors' values based on the indices from the intersection.
Do pairwise multiplication of the elements of those two subarrays.
Sum the values resulting values from such pairwise multiplication.
Here's my realization of it:
import org.apache.spark.ml.linalg.SparseVector
def dotProduct(vec: SparseVector, vecOther: SparseVector) = {
val commonIndices = vec.indices intersect vecOther.indices
commonIndices.map(x => vec(x) * vecOther(x)).reduce(_+_)
}
I guess you know how to turn it into a Spark UDF from here and apply it to your dataframe's columns.
And if you normalize your sparse vectors with org.apache.spark.ml.feature.Normalizer before computing your dot product, you'll get cosine similarity in the end (by definition).
Great answer above #Sergey-Zakharov +1.
A few adds-on:
The reduce doesn't work on empty sequences.
Make sure computing L2 normalization.
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
val l2NormData = normalizer.transform(df_features)
and
val dotProduct = udf {(v1: SparseVector, v2: SparseVector) =>
v1.indices.intersect(v2.indices).map(x => v1(x) * v2(x)).reduceOption(_ + _).getOrElse(0.0)
}
and then
val df = dfA.crossJoin(broadcast(dfB))
.withColumn("dot", dotProduct(col("featuresA"), col("featuresB")))
If the number of vectors you want to calculate the dot product with is small, cache the RDD[Vector] table. Create a new table [cosine_vectors] that is a filter on the original table to only select the vectors you want the cosine similarities for. Broadcast join those two together and calculate.

Spark out of memory when reducing by key

I'm working on a algorithm that requires math operations on large matrix. Basically, the algorithm involves the following steps:
Inputs: two vectors u and v of size n
For each vector, compute pairwise Euclidean distance between elements in the vector. Return two matrix E_u and E_v
For each entry in the two matrices, apply a function f. Return two matrix M_u, M_v
Find the eigen values and eigen vectors of M_u. Return e_i, ev_i for i = 0,...,n-1
Compute the outer product for each eigen vector. Return a matrix O_i = e_i*transpose(e_i), i = 0,...,n-1
Adjust each eigen value with e_i = e_i + delta_i, where delta_i = sum all elements(elementwise product of O_i and M_v)/2*mu, where mu is a parameter
Final return a matrix A = elementwise sum (e_i * O_i) over i = 0,...,n-1
The issue I'm facing is mainly the memory when n is large (15000 or more), since all matrices here are dense matrices. My current way to implement this may not be the best, and partially worked.
I used a RowMatrix for M_u and get eigen decomposition using SVD.
The resulting U factor of SVD is a row matrix whose columns are ev_i's, so I have to manually transpose it so that its rows become ev_i. The resulting e vector is the eigen values e_i.
Since a previous attempt of directly mapping each row ev_i to O_i failed due to out of memory, I'm currently doing
R = U.map{
case(i,ev_i) => {
(i, ev_i.toArray.zipWithIndex)
}
}//add index for each element in a vector
.flatMapValues(x=>x)}
.join(U)//eigen vectors column is appended
.map{case(eigenVecId, ((vecElement,elementId), eigenVec))=>(elementId, (eigenVecId, vecElement*eigenVec))}
To compute adjusted e_i's in step 5 above, M_v is stored as rdd of tuples (i, denseVector). Then
deltaRdd = R.join(M_v)
.map{
case(j,((i,row_j_of_O_i),row_j_of_M_v))=>
(i,row_j_of_O_i.t*DenseVector(row_j_of_M_v.toArray)/(2*mu))
}.reduceByKey(_+_)
Finally, to compute A, again due to memory issue, I have to first joining rows from different rdds and then reducing by key. Specifically,
R_rearranged = R.map{case(j, (i, row_j_of_O_i))=>(i,(j,row_j_of_O_i))}
termsForA = R_rearranged.join(deltaRdd)
A = termsForA.map{
case(i,(j,row_j_of_O_i), delta_i)) => (j, (delta_i + e(i))*row_j_of_O_i)
}
.reduceByKey(_+_)
The above implementation worked to the step of termsForA, which means if I execute an action on termsForA like termsForA.take(1).foreach(println), it succeeded. But if I execute an action on A, like A.count(), an OOM error occured on driver.
I tried to tune sparks configuration to increase driver memory as well as parallelism level, but all failed.
Use IndexedRowMatrix instead of RowMatrix, it will help in conversions and transpose.
Suppose your IndexedRowMatrix is Irm
svd = Irm.computeSVD(k, True)
U = svd.U
U = U.toCoordinateMatrix().transpose().toIndexedRowMatrix()
You can convert Irm to BlockMatrix for multiplication with another distributed BlockMatrix.
I guess at some point Spark decided there's no need to carry out operations on executors, and do all the work on driver. Actually, termsForA would fail as well in action like count. Somehow I made it work by broadcasting deltaRdd and e.

Matlab double summation of series

I am trying to use a function in Matlab which will give me the following equation:
The x and a values are in two matrices. I have tried almost everything, but cannot get the correct answer. Anyone who can help??
Thanks
Assuming A and X are vectors of size n x 1, you could construct that expression by writing transpose(X) * (sqrt(A * transpose(A)) .* (ones(n) - eye(n))) * X.
Another way to do this is
a = sqrt(ain); % ain is your input column vector
A = a*a.';
A = A-diag(diag(A));
aresult = x.'*A*x % x is your (other) input column vector