How to slice a sparse matrix in scala breeze? - scala

I create a sparse matrix in scala breeze, ie using http://www.scalanlp.org/api/breeze/linalg/CSCMatrix.html. Now I want to get a column slice from it. How to do this?
Edit: there are some further requirements:
it is important to me that I can actually do something useful with the slice, eg multiply it by a float:
X(::,n) * 3.
It's also important to me that the resulting structure/matrix/vector remains sparse. Each column might have a dense dimension of several million, but in fact have only 600 entries or so.
I need to be able to use this to mutate the matrix, eg:
X(::,0) = X(::,1)

Slicing works the same as for DenseMatrix, which is discussed in the Quickstart.
val m1 = CSCMatrix((1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12), (13, 14, 15, 16))
val m2 = m1(1 to 2, 1 to 2)
println(m2)
This prints:
6 7
10 11

I wrote my own slicer method in the end. Use like this:
val col = root.MatrixHelper.colSlice( sparseMatrix, columnIndex )
code:
// Copyright Hugh Perkins 2012
// You can use this under the terms of the Apache Public License 2.0
// http://www.apache.org/licenses/LICENSE-2.0
package root
import breeze.linalg._
object MatrixHelper {
def colSlice( A: CSCMatrix[Double], colIndex: Int ) : SparseVector[Double] = {
val size = A.rows
val rowStartIndex = A.colPtrs(colIndex)
val rowEndIndex = A.colPtrs(colIndex + 1) - 1
val capacity = rowEndIndex - rowStartIndex + 1
val result = SparseVector.zeros[Double](size)
result.reserve(capacity)
var i = 0
while( i < capacity ) {
val thisindex = rowStartIndex + i
val row = A.rowIndices(thisindex)
val value = A.data(thisindex)
result(row) = value
i += 1
}
result
}
}

Related

How do iterate a sequence with varying starting positions

Say I have an array:
[10,12,20,50]
I can iterate though this array like normal which would look at the position at 0, then 1, 2, and 3.
What if I wanted to start an any arbritrary position in the array, and then go through all the numbers in order.
So the other permutations would be:
10,12,20,50
12,20,50,10
20,50,10,12
50,10,12,20
Is there a general function that would allow me to do this type of sliding iteration?
so looking at the index positions from the above it would be:
0,1,2,3
1,2,3,0
2,3,0,1
3,0,1,2
It would be great if some languages have this built in, but I want to know the algorithm to do this also so I understand.
Let's iterate over an array.
val arr = Array(10, 12, 20, 50)
for (i <- 0 to arr.length - 1) {
println(arr(i))
}
With output:
10
12
20
50
Pretty basic.
What about:
val arr = Array(10, 12, 20, 50)
for (i <- 2 to (2 + arr.length - 1)) {
println(arr(i))
}
Oops. Out of bounds. But what if we modulo that index by the length of the array?
val arr = Array(10, 12, 20, 50)
for (i <- 2 to (2 + arr.length - 1)) {
println(arr(i % arr.length))
}
20
50
10
12
Now you just need to wrap it up in a function that replaces 2 in that example with an argument.
There is no language builtin. There is a similar method permutations, but it will generate all permutations without the order, which doesn't really fit your need.
Your requirement can be implemented with a simple algorithm where you just concatenates two slices:
def orderedPermutation(in: List[Int]): Seq[List[Int]] = {
for(i <- 0 until in.size) yield
in.slice(i, in.size) ++ in.slice(0, i)
}
orderedPermutation(List(10,12,20,50)).foreach(println)
Working code here

List whose elements depend on the previous ones

Suppose I have a list of increasing integers. If the difference of 2 consecutive numbers is less than a threshold, then we index them by the same number, starting from 0. Otherwise, we increase the index by 1.
For example: for the list (1,2,5,7,8,11,15,16,20) and the threshold = 3, the output will be: (0, 0, 1, 1, 1, 2, 3, 3, 4).
Here is my code:
val listToGroup = List(1,2,5,7,8,11,15,16,20)
val diff_list = listToGroup.sliding(2,1).map{case List(i, j) => j-i}.toList
val thres = 2
var j=0
val output_ = for(i <- diff_list.indices) yield {
if (diff_list(i) > thres ) {
j += 1
}
j
}
val output = List.concat(List(0), output_)
I'm new to Scala and I feel the list is not used efficiently. How can this code be improved?
You can avoid the mutable variable by using scanLeft to get a more idiomatic code:
val output = diff_list.scanLeft(0) { (count, i) =>
if (i > thres) count + 1
else count
}
Your code shows some constructs which are usually avoided in Scala, but common when coming from procedural langugues, like: for(i <- diff_list.indices) ... diff_list(i) can be replaced with for(i <- diff_list).
Other than that, I think your code is efficient - you need to traverse the list anyway and you do it in O(N). I would not worry about efficiency here, more about style and readability.
My rewrite to how I think it would be more natural in Scala for the whole code would be:
val listToGroup = List(1,2,5,7,8,11,15,16,20)
val thres = 2
val output = listToGroup.zip(listToGroup.drop(1)).scanLeft(0) { case (count, (i, j)) =>
if (j - i > thres) count + 1
else count
}
My adjustments to your code:
I use scanLeft to perform the result collection construction
I prefer x.zip(x.drop(1)) over x.sliding(2, 1) (constructing tuples seems a bit more efficient than constructing collections). You could also use x.zip(x.tail), but that does not handle empty x
I avoid the temporary result diff_list
val listToGroup = List(1, 2, 5, 7, 8, 11, 15, 16, 20)
val thres = 2
listToGroup
.sliding(2)
.scanLeft(0)((a, b) => { if (b.tail.head - b.head > thres) a + 1 else a })
.toList
.tail
You don't need to use mutable variable, you can achieve the same with scanLeft.

spark aggregatebykey - sum and running average in the same call

I am learning spark, and do not have experience in hadoop.
Problem
I am trying to calculate the sum and average in the same call to aggregateByKey.
Let me share what I have tried so far.
Setup the data
val categoryPrices = List((1, 20), (1, 25), (1, 10), (1, 45))
val categoryPricesRdd = sc.parallelize(categoryPrices)
Attempt to calculate the average in the same call to aggregateByKey. This does not work.
val zeroValue1 = (0, 0, 0.0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue1)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
val newAverage = newSum/newCount
(newCount, newSum, newAverage)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
// TRYING TO CALCULATE THE RUNNING AVERAGE HERE
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1))/(tuple1._1 + tuple2._1)
(newCount1, newSum1, newAverage1)
}
).
collect.
foreach(println)
Result: Prints a different average each time
First time: (1,(4,100,70.0))
Second time: (1,(4,100,52.0))
Just do the sum first, and then calculate the average in a separate operation. This works.
val zeroValue2 = (0, 0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue2)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
(newCount, newSum)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
(newCount1, newSum1)
}
).
map(rec => {
val category = rec._1
val count = rec._2._1
val sum = rec._2._2
(category, count, sum, sum/count)
}).
collect.
foreach(println)
Prints the same result every time:
(1,4,100,25)
I think I understand the difference between seqOp and CombOp. Given that an operation can split data across multiple partitions on different servers, my understanding is that seqOp operates on data in a single partition, and then combOp combines data received from different partitions. Please correct if this is wrong.
However, there is something very basic that I am not understanding. Looks like we can't calculate both the sum and average in the same call. If this is true, please help me understand why.
The computation related to your average aggregation in seqOp:
val newAverage = newSum/newCount
and in combOp:
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1)) / (tuple1._1 + tuple2._1)
is incorrect.
Let's say the first three elements are in one partition and the last element in another. Your seqOp would generate the (count, sum, average) tuples as follows:
Partition #1: [20, 25, 10]
--> (1, 20, 20/1)
--> (2, 45, 45/2)
--> (3, 55, 55/3)
Partition #2: [45]
--> (1, 45, 45/1)
Next, the cross-partition combOp would combine the 2 tuples from the two partitions to give:
((55 * 3) + (45 * 1)) / 4
// Result: 52
As you can see from the above steps, the average value could be different if the ordering of the RDD elements or the partitioning is different.
Your 2nd approach works, as average is by definition total sum over total count hence is better calculated after first computing the sum and count values.

How to declare a sparse Vector in Spark with Scala?

I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column.
Test file
1 2 4
1 3 5
1 4 8
2 7 5
2 8 4
2 9 10
Code
val data = sc.textFile("/home/savvas/DWDM/test.txt")
val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val grouped = data2.groupBy( _(0) )
This results in grouped having these values
(2.0,CompactBuffer([2.0,7.0,5.0], [2.0,8.0,4.0], [2.0,9.0,10.0]))
(1.0,CompactBuffer([1.0,2.0,4.0], [1.0,3.0,5.0], [1.0,4.0,8.0]))
But I can't seem to figure out the next step. I need to take each line of grouped and create a vector for it, so that each line of the new RDD has a vector with the third value of the CompactBuffer in the index specified by the second value. In short, what I mean is that I want my data in the example like this.
[0, 0, 0, 0, 0, 0, 5.0, 4.0, 10.0, 0]
[0, 4.0, 5.0, 8.0, 0, 0, 0, 0, 0, 0]
I know I need to use a sparse vector, and that there are three ways to construct it. I've tried using a Seq with a tuple2(index, value) , but I cannot understand how to create such a Seq.
One possible solution is something like below. First lets convert data to expected types:
import org.apache.spark.rdd.RDD
val pairs: RDD[(Double, (Int, Double))] = data.map(_.split(" ") match {
case Array(label, idx, value) => (label.toDouble, (idx.toInt, value.toDouble))
})
next find a maximum index (size of the vectors):
val nCols = pairs.map{case (_, (i, _)) => i}.max + 1
group and convert:
import org.apache.spark.mllib.linalg.SparseVector
def makeVector(xs: Iterable[(Int, Double)]) = {
val (indices, values) = xs.toArray.sortBy(_._1).unzip
new SparseVector(nCols, indices.toArray, values.toArray)
}
val transformed: RDD[(Double, SparseVector)] = pairs
.groupByKey
.mapValues(makeVector)
Another way you can handle this, assuming that the first elements can be safely converted to and from integer, is to use CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = data.map(_.split(" ") match {
case Array(label, idx, value) =>
MatrixEntry(label.toInt, idx.toInt, value.toDouble)
})
val transformed: RDD[(Double, SparseVector)] = new CoordinateMatrix(entries)
.toIndexedRowMatrix
.rows
.map(row => (row.index.toDouble, row.vector))

Spark - correlation matrix from file of ratings

I'm pretty new to Scala and Spark and I'm not able to create a correlation matrix from a file of ratings. It's similar to this question but I have sparse data in the matrix form. My data looks like this:
<user-id>, <rating-for-movie-1-or-null>, ... <rating-for-movie-n-or-null>
123, , , 3, , 4.5
456, 1, 2, 3, , 4
...
The code that is most promising so far looks like this:
val corTest = sc.textFile("data/collab_filter_data.txt").map(_.split(","))
Statistics.corr(corTest, "pearson")
(I know the user_ids in there are a defect, but I'm willing to live with that for the moment)
I'm expecting output like:
1, .123, .345
.123, 1, .454
.345, .454, 1
It's a matrix showing how each user is correlated to every other user. Graphically, it would be a correlogram.
It's a total noob problem but I've been fighting with it for a few hours and can't seem to Google my way out of it.
I believe this code should accomplish what you want:
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg._
...
val corTest = input.map { case (line: String) =>
val split = line.split(",").drop(1)
split.map(elem => if (elem.trim.isEmpty) 0.0 else elem.toDouble)
}.map(arr => Vectors.dense(arr))
val corrMatrix = Statistics.corr(corTest)
Here, we are mapping your input into a String array, dropping the user id element, zeroing out your whitespace, and finally creating a dense vector from the resultant array. Also, note that Pearson's method is used by default if no method is supplied.
When run in shell with some examples, I see the following:
scala> val input = sc.parallelize(Array("123, , , 3, , 4.5", "456, 1, 2, 3, , 4", "789, 4, 2.5, , 0.5, 4", "000, 5, 3.5, , 4.5, "))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:16
scala> val corTest = ...
corTest: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[20] at map at <console>:18
scala> val corrMatrix = Statistics.corr(corTest)
...
corrMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.9037378388935388 -0.9701425001453317 ... (5 total)
0.9037378388935388 1.0 -0.7844645405527361 ...
-0.9701425001453317 -0.7844645405527361 1.0 ...
0.7709910794438823 0.7273340668525836 -0.6622661785325219 ...
-0.7513578452729373 -0.7560667258329613 0.6195855517393626 ...