How to avoid for loop with Spark? - scala

i'm new to spark and don't understand how mapreduce mechanism works with spark. I have one csv file with only doubles, what i want is to make an operation (compute euclidian distance) with the first vector with the rest of the rdd. Then iterate with the other vectors. It is exist a other way than this one ? Maybe use wisely the cartesian product...
val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),(2,Vectors.dense(3,4),...)))
val array_vects = rdd.collect
val size = rdd.count
val emptyArray = Array((0,Vectors.dense(0))).tail
var rdd_rez = sc.parallelize(emptyArray)
for( ind <- 0 to size -1 ) {
val vector = array_vects(ind)._2
val rest = rdd.filter(x => x._1 != ind)
val rdd_dist = rest.map( x => (x._1 , Vectors.sqdist(x._2,vector)))
rdd_rez = rdd_rez ++ rdd_dist
}
Thank you for your support.

The distances (between all pairs of vectors) can be calculated using rdd.cartesian:
val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),
(2,Vectors.dense(3,4)),...))
val product = rdd.cartesian(rdd)
val result = product.filter{ case ((a, b), (c, d)) => a != c }
.map { case ((a, b), (c, d)) =>
(a, Vectors.sqdist(b, d)) }

I don't think why you were trying to do something like that. you can simply do this as follows.
val initialArray = Array( ( 1,Vectors.dense(1,2) ), ( 2,Vectors.dense(3,4) ),... )
val firstVector = initialArray( 0 )
val initialRdd = sc.parallelize( initialArray )
val euclideanRdd = initialRdd.map( { case ( i, vec ) => ( i, euclidean( firstVector, vec ) ) } )
Where we define a function euclidean which take two dense vectors and returns euclidean distances.

Related

Matrix Vector multiplication in Scala

I am having a Matrix of size D by D (implemented as List[List[Int]]) and a Vector of size D (implemented as List[Int]). Assuming value of D = 3, I can create matrix and vector in following way.
val Vector = List(1,2,3)
val Matrix = List(List(4,5,6) , List(7,8,9) , List(10,11,12))
I can multiply both these as
(Matrix,Vector).zipped.map((x,y) => (x,Vector).zipped.map(_*_).sum )
This code multiplies matrix with vector and returns me vector as needed. I want to ask is there any faster or optimal way to get the same result using Scala functional style? As in my scenario I have much bigger value of D.
What about something like this?
def vectorDotProduct[N : Numeric](v1: List[N], v2: List[N]): N = {
import Numeric.Implicits._
// You may replace this with a while loop over two iterators if you require even more speed.
#annotation.tailrec
def loop(remaining1: List[N], remaining2: List[N], acc: N): N =
(remaining1, remaining2) match {
case (x :: tail1, y :: tail2) =>
loop(
remaining1 = tail1,
remaining2 = tail2,
acc + (x * y)
)
case (Nil, _) | (_, Nil) =>
acc
}
loop(
remaining1 = v1,
remaining2 = v2,
acc = Numeric[N].zero
)
}
def matrixVectorProduct[N : Numeric](matrix: List[List[N]], vector: List[N]): List[N] =
matrix.map(row => vectorDotProduct(vector, row))

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

How to convert RowMatrix to BDM (Breeze Dense Matrix) and more questions

trying to convert a RowMatrix into BDM (Breeze Dense Matrix), not sure how to proceed
need to implement
def getDenseMatrix(A: RowMatrix): BDM[Double] = {
//write code here
}
additional questions:
how to convert a RowMatrix into a Matrix?
How to access a particular Row in the RowMatrix?
for(i <- 0 to (RowM.numCols().toInt-1)){
//How to access RowM.rows(i)
}
How to access a particular column in the RowMatrix?
for(i <- 0 to (RowM.numCols().toInt-1)){
//How to access RowM.rows.map(f=>f(i))
}
How to multiply 2 RowMatrices
note: RowMatrix has a API 'multiply' but it need the argument of type Matrix
say A and B are RowMatices, then AB = A.multiply(B), this will not work, as B
is a RowMatrix and not Matrix
And lastly how to convert a BDM to a RowMatrix?
Read the source code in rowMatrix, you want the source code, it is in private method. The code is below:
def toBreeze(mat:RowMatrix):BDM[Double] = {
val m = mat.numRows()
val n = mat.numCols()
val result = BDM.zeros[Double](m.toInt,n.toInt)
var i = 0
mat.rows.collect().foreach{Vector =>
Vector.foreachActive { case(index,value) =>
result(i,index) = value
}
i+=1
}
result
}
If you don't have foreachActive available use this:
def toBreeze(X: RowMatrix): BDM[Double]{
val m = X.numRows().toInt
val n = X.numCols().toInt
val mat = BDM.zeros[Double](m, n)
var i = 0
X.rows.collect().map{
case sp: SparseVector => (sp.indices, sp.values)
case dp: DenseVector => (Range(0,n).toArray, dp.values)
}.foreach {
case (indices, values) => indices.zip(values).foreach { case (j, v) =>mat(i, j) = v }
i += 1
}
mat
}
This is parallel up to a certain point and do the same.

Averaging values at same position in List

The following code averages the values at same position within an array:
val toadd = List(Array(8.0, 4.0), Array(5.0, 8.0), Array(7.0, 5.0))
val a1 = toadd.map(m => m(0)).sum
val a2 = toadd.map(m => m(1)).sum
(a1/toadd.size , a2/toadd.size)
Currently this just works for arrays of length 2.
How can this be modified so that it works for arrays of arbitrary length?
How about using transpose:
toadd.transpose.map(xs => xs.sum / xs.size)
// List(6.666666666666667, 5.666666666666667)
I like the idea of using transpose, as suggested by dhg. If you wanted to use more primitive functions, you could do:
toadd reduce {
(x, y) => (x zip y) map {
case (a, b) => a + b
}
} map { a => a / toadd.length }
Or more concisely:
toadd.reduce(_.zip(_).map(a=>a._1+a._2)).map(_/toadd.length)
You want something like
val innerSize = toadd.map(_.length).min
and then map from 0 until innerSize instead of doing it manually with a1, a2, etc..

Applying operation to corresponding elements of Array

I want to sum the corresponding elements of the list and multiply the results while keeping the label associated with the array element so
("a",Array((0.5,1.0),(0.667,2.0)))
becomes :
(a , (0.5 + 0.667) * (1.0 + 2.0))
Here is my code to express this for a single array element :
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
//> data : Array[(String, Array[(Double, Double)])] = Array((a,Array((0.5,1.0),
//| (0.667,2.0))), (b,Array((0.6,2.0), (0.6,2.0))))
val v1 = (data(0)._1, data(0)._2.map(m => m._1).sum)
//> v1 : (String, Double) = (a,1.167)
val v2 = (data(0)._1, data(0)._2.map(m => m._2).sum)
//> v2 : (String, Double) = (a,3.0)
val total = (v1._1 , (v1._2 * v2._2)) //> total : (String, Double) = (a,3.5010000000000003)
I just want apply this function to all elements of the array so val "data" above becomes :
Map[(String, Double)] = ((a,3.5010000000000003),(b,4.8))
But I'm not sure how to combine the above code into a single function which maps over all the array elements ?
Update : the inner Array can be of variable length so this is also valid :
val data = Array(("a",Array((0.5,1.0,2.0),(0.667,2.0,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
Pattern matching is your friend! You can use it for tuples and arrays. If there are always two elements in the inner array, you can do it this way:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
data.map {
case (s, Array((x1, x2), (x3, x4))) => s -> (x1 + x3) * (x2 + x4)
}
// Array[(String, Double)] = Array((a,3.5010000000000003), (b,4.8))
res6.toMap
// scala.collection.immutable.Map[String,Double] = Map(a -> 3.5010000000000003, b -> 4.8)
If the inner elements are variable length, you could do it this way (a for comprehension instead of explicit maps):
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).sum
sum2 = tuples.map(_._2).sum
} yield s -> sum1 * sum2
Note that while this is a very clear solution, it's not the most efficient possible, because we're iterating over the tuples twice. You could use a fold instead, but it would be much harder to read (for me anyway. :)
Finally, note that .sum will produce zero on an empty collection. If that's not what you want, you could do this instead:
val emptyDefault = 1.0 // Or whatever, depends on your use case
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).reduceLeftOption(_ + _).getOrElse(emptyDefault)
sum2 = tuples.map(_._2).reduceLeftOption(_ + _).getOrElse(emptyDefault)
} yield s -> sum1 * sum2
You can use algebird numeric library:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
import com.twitter.algebird.Operators._
def sumAndProduct(a: Array[(Double, Double)]) = {
val sums = a.reduceLeft((m, n) => m + n)
sums._1 * sums._2
}
data.map{ case (x, y) => (x, sumAndProduct(y)) }
// Array((a,3.5010000000000003), (b,4.8))
It will work fine for variable size array as well.
val data = Array(("a",Array((0.5,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
// Array((a,0.5), (b,4.8))
Like this? Does your array always have only 2 pairs?
val m = data map ({case (label,Array(a,b)) => (label, (a._1 + b._1) * (a._2 + b._2)) })
m.toMap