Spark - correlation matrix from file of ratings - scala

I'm pretty new to Scala and Spark and I'm not able to create a correlation matrix from a file of ratings. It's similar to this question but I have sparse data in the matrix form. My data looks like this:
<user-id>, <rating-for-movie-1-or-null>, ... <rating-for-movie-n-or-null>
123, , , 3, , 4.5
456, 1, 2, 3, , 4
...
The code that is most promising so far looks like this:
val corTest = sc.textFile("data/collab_filter_data.txt").map(_.split(","))
Statistics.corr(corTest, "pearson")
(I know the user_ids in there are a defect, but I'm willing to live with that for the moment)
I'm expecting output like:
1, .123, .345
.123, 1, .454
.345, .454, 1
It's a matrix showing how each user is correlated to every other user. Graphically, it would be a correlogram.
It's a total noob problem but I've been fighting with it for a few hours and can't seem to Google my way out of it.

I believe this code should accomplish what you want:
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg._
...
val corTest = input.map { case (line: String) =>
val split = line.split(",").drop(1)
split.map(elem => if (elem.trim.isEmpty) 0.0 else elem.toDouble)
}.map(arr => Vectors.dense(arr))
val corrMatrix = Statistics.corr(corTest)
Here, we are mapping your input into a String array, dropping the user id element, zeroing out your whitespace, and finally creating a dense vector from the resultant array. Also, note that Pearson's method is used by default if no method is supplied.
When run in shell with some examples, I see the following:
scala> val input = sc.parallelize(Array("123, , , 3, , 4.5", "456, 1, 2, 3, , 4", "789, 4, 2.5, , 0.5, 4", "000, 5, 3.5, , 4.5, "))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:16
scala> val corTest = ...
corTest: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[20] at map at <console>:18
scala> val corrMatrix = Statistics.corr(corTest)
...
corrMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.9037378388935388 -0.9701425001453317 ... (5 total)
0.9037378388935388 1.0 -0.7844645405527361 ...
-0.9701425001453317 -0.7844645405527361 1.0 ...
0.7709910794438823 0.7273340668525836 -0.6622661785325219 ...
-0.7513578452729373 -0.7560667258329613 0.6195855517393626 ...

Related

Adding Sparse Vectors 3.0.0 Apache Spark Scala

I am trying to create a function as the following to add
two org.apache.spark.ml.linalg.Vector. or i.e two sparse vectors
This vector could look as the following
(28,[1,2,3,4,7,11,12,13,14,15,17,20,22,23,24,25],[0.13028398104008743,0.23648605632753023,0.7094581689825907,0.13028398104008743,0.23648605632753023,0.0,0.14218861229025295,0.3580566057240087,0.14218861229025295,0.13028398104008743,0.26056796208017485,0.0,0.14218861229025295,0.06514199052004371,0.13028398104008743,0.23648605632753023])
For e.g.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector): org.apache.spark.ml.linalg.Vector = {
}
Let's look at a use case
val x = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val y = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to output to be
Vectors.sparse(2, List(0,1), List(1,1))
Here's another case where they share the same indices
val x = Vectors.sparse(2, List(1), List(1))
val y = Vectors.sparse(2, List(1), List(1))
This output should be
Vectors.sparse(2, List(1), List(2))
I've realized doing this is harder than it seems. I looked into one possible solution of converting the vectors into breeze, adding them in breeze and then converting it back to a vector. e.g Addition of two RDD[mllib.linalg.Vector]'s. So I tried implementing this.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector) ={
val dense_x = x.toDense
val dense_y = y.toDense
val bv1 = new DenseVector(dense_x.toArray)
val bv2 = new DenseVector(dense_y.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout
}
however this gave me an error in the last line
val vectout = Vectors.dense((bv1 + bv2).toArray)
Cannot resolve the overloaded method 'dense'.
I'm wondering why is error is occurring and ways to fix it?
To answer my own question, I had to think about how sparse vectors are. For e.g. Sparse Vectors require 3 arguments. the number of dimensions, an array of indices, and finally an array of values. For e.g.
val indices: Array[Int] = Array(1,2)
val norms: Array[Double] = Array(0.5,0.3)
val num_int = 4
val vector: Vector = Vectors.sparse(num_int, indices, norms)
If I converted this SparseVector to an Array I would get the following.
code:
val choiced_array = vector.toArray
choiced_array.map(element => print(element + " "))
Output:
[0.0, 0.5,0.3,0.0].
This is considered a more dense representation of it. So once you convert the two vectors to array you can add them with the following code
val add: Array[Double] = (vector.toArray, vector_2.toArray).zipped.map(_ + _)
This gives you another array of them both added. Next to create your new sparse vector, you would want to create an indices array as shown in the construction
var i = -1;
val new_indices_pre = add.map( (element:Double) => {
i = i + 1
if(element > 0.0)
i
else{
-1
}
})
Then lets filter out all -1 indices indication that indicate zero for that indice.
new_indices_pre.filter(element => element != -1)
Remember to filter out none zero values from the array which has the addition of the two vectors.
val final_add = add.filter(element => element > 0.0)
Lastly, we can make the new sparse Vector
Vectors.sparse(num_int,new_indices,final_add)

dot product of dense vectors with nulls

I am using spark with scala and trying to do the following.
I have two dense vectors(created using Vectors.dense), and I need to find the dot product of these. How could I accomplish this?
Also, I am creating the vectors based on an input file which is comma seperated. However some values are missing. Is there an easy way to read these values as zero instead of null when I am creating the vectors?
For example:
input file: 3,1,,,2
created vector: 3,1,0,0,2
Spark vectors are just wrappers for arrays, internally they get converted to Breeze arrays for vector/matrix operations. You can do just that manually to get the dot product:
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
val dv1: Vector = Vectors.dense(1.0, 0.0, 3.0)
val bdv1 = new BDV(dv1.toArray)
val dv2: Vector = Vectors.dense(2.0, 0.0, 0.0)
val bdv2 = new BDV(dv2.toArray)
scala> bdv1 dot bdv2
res3: Double = 2.0
For your second question, you can do something like this:
val v: String = "3,1,,,2"
scala> v.split("\\,").map(r => if (r == "") 0 else r.toInt)
res4: Array[Int] = Array(3, 1, 0, 0, 2)

in scala how do we aggregate an Array to determine the count per Key and the percentage vs total

I am trying to find an efficient way to find the following :
Int1 = 1 or 0, Int2 = 1..k (where k = 3) and Double = 1.0
I want to find how many 1 or 0 are there in every k
I need to find the percentage of result of 3 on the total of the size of the Array??
Input is :
val clusterAndLabel = sc.parallelize(Array((0, 0), (0, 0), (1, 0), (1, 1), (2, 1), (2, 1), (2, 0)))
So in this example:
I have : 0,0 = 2 , 0,1 = 0
I have : 1,0 = 1 , 1,1 = 1
I have : 2,1 = 2 , 2,0 = 1
Total is 7 instances
I was thinking of doing some aggegation but I am stuck on the thought that they are both considered 2-key join
If you want to find how many 1 and 0s there are you can do:
val rdd = clusterAndLabel.map(x => (x,1)).reduceByKey(_+_)
this will give you an RDD[(Int,Int),Int] containing exactly what you described, meaning: [((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]. If you really want them gathered by their first key, you can add this line:
val rdd2 = rdd.map(x => (x._1._1, (x._1._2, x._2))).groupByKey()
this will yield an RDD[(Int, (Int,Int)] which will look like what you described, i.e.: [(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])].
If you need the number of instances, it looks like (at least in your example) clusterAndLabel.count() should do the work.
I don't really understand question 3? I can see two things:
you want to know how many keys have 3 occurrences. To do so, you can start from the object I called rdd (no need for the groupByKey line) and do so:
val rdd3 = rdd.map(x => (x._2,1)).reduceByKey(_+_)
this will yield and RDD[(Int,Int)] which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: [(1,3),(2,2)]. So if you want to know how many pairs occur 3 times, you just do rdd3.filter(_._1==3).collect() (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).
you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from rdd2 and do:
val rdd3 = rdd2.map(x=>(x._1,x._2.size)).filter(_._1==3).collect()
once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display rdd2, you can just do:
val rdd4 = rdd.map(x => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()
(for performance you might want to do the filter before reduceByKey also!)

How to declare a sparse Vector in Spark with Scala?

I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column.
Test file
1 2 4
1 3 5
1 4 8
2 7 5
2 8 4
2 9 10
Code
val data = sc.textFile("/home/savvas/DWDM/test.txt")
val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val grouped = data2.groupBy( _(0) )
This results in grouped having these values
(2.0,CompactBuffer([2.0,7.0,5.0], [2.0,8.0,4.0], [2.0,9.0,10.0]))
(1.0,CompactBuffer([1.0,2.0,4.0], [1.0,3.0,5.0], [1.0,4.0,8.0]))
But I can't seem to figure out the next step. I need to take each line of grouped and create a vector for it, so that each line of the new RDD has a vector with the third value of the CompactBuffer in the index specified by the second value. In short, what I mean is that I want my data in the example like this.
[0, 0, 0, 0, 0, 0, 5.0, 4.0, 10.0, 0]
[0, 4.0, 5.0, 8.0, 0, 0, 0, 0, 0, 0]
I know I need to use a sparse vector, and that there are three ways to construct it. I've tried using a Seq with a tuple2(index, value) , but I cannot understand how to create such a Seq.
One possible solution is something like below. First lets convert data to expected types:
import org.apache.spark.rdd.RDD
val pairs: RDD[(Double, (Int, Double))] = data.map(_.split(" ") match {
case Array(label, idx, value) => (label.toDouble, (idx.toInt, value.toDouble))
})
next find a maximum index (size of the vectors):
val nCols = pairs.map{case (_, (i, _)) => i}.max + 1
group and convert:
import org.apache.spark.mllib.linalg.SparseVector
def makeVector(xs: Iterable[(Int, Double)]) = {
val (indices, values) = xs.toArray.sortBy(_._1).unzip
new SparseVector(nCols, indices.toArray, values.toArray)
}
val transformed: RDD[(Double, SparseVector)] = pairs
.groupByKey
.mapValues(makeVector)
Another way you can handle this, assuming that the first elements can be safely converted to and from integer, is to use CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = data.map(_.split(" ") match {
case Array(label, idx, value) =>
MatrixEntry(label.toInt, idx.toInt, value.toDouble)
})
val transformed: RDD[(Double, SparseVector)] = new CoordinateMatrix(entries)
.toIndexedRowMatrix
.rows
.map(row => (row.index.toDouble, row.vector))

How to slice a sparse matrix in scala breeze?

I create a sparse matrix in scala breeze, ie using http://www.scalanlp.org/api/breeze/linalg/CSCMatrix.html. Now I want to get a column slice from it. How to do this?
Edit: there are some further requirements:
it is important to me that I can actually do something useful with the slice, eg multiply it by a float:
X(::,n) * 3.
It's also important to me that the resulting structure/matrix/vector remains sparse. Each column might have a dense dimension of several million, but in fact have only 600 entries or so.
I need to be able to use this to mutate the matrix, eg:
X(::,0) = X(::,1)
Slicing works the same as for DenseMatrix, which is discussed in the Quickstart.
val m1 = CSCMatrix((1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12), (13, 14, 15, 16))
val m2 = m1(1 to 2, 1 to 2)
println(m2)
This prints:
6 7
10 11
I wrote my own slicer method in the end. Use like this:
val col = root.MatrixHelper.colSlice( sparseMatrix, columnIndex )
code:
// Copyright Hugh Perkins 2012
// You can use this under the terms of the Apache Public License 2.0
// http://www.apache.org/licenses/LICENSE-2.0
package root
import breeze.linalg._
object MatrixHelper {
def colSlice( A: CSCMatrix[Double], colIndex: Int ) : SparseVector[Double] = {
val size = A.rows
val rowStartIndex = A.colPtrs(colIndex)
val rowEndIndex = A.colPtrs(colIndex + 1) - 1
val capacity = rowEndIndex - rowStartIndex + 1
val result = SparseVector.zeros[Double](size)
result.reserve(capacity)
var i = 0
while( i < capacity ) {
val thisindex = rowStartIndex + i
val row = A.rowIndices(thisindex)
val value = A.data(thisindex)
result(row) = value
i += 1
}
result
}
}