Spark Scala: How to work with each 3 elements of rdd? - scala

everyone.
I have such problem:
I have very big rdd: billions elements like:
Array[((Int, Int), Double)] = Array(((0,0),729.0), ((0,1),169.0), ((0,2),1.0), ((0,3),5.0), ...... ((34,45),34.0), .....)
I need to do such operation:
take value of each element by key (i,j) and add to it the
min(rdd_value[(i-1, j)],rdd_value[(i, j-1)], rdd_value[(i-1, j-1)])
How can I do this without using collect() as After collect() I have got Java memory errror as my rdd is very big.
Thank you very much!
I try to realize this algorithm from python. when time series are rdds.
def DTWDistance(s1, s2):
DTW={}
for i in range(len(s1)):
DTW[(i, -1)] = float('inf')
for i in range(len(s2)):
DTW[(-1, i)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(len(s2)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])
And now I should perform last operation with for loop. The dist is already calculated.
Example:
Input (like matrix):
4 5 1
7 2 3
9 0 1
Rdd looks like
rdd.take(10)
Array(((1,1), 4), ((1,2), 5), ((1,3), 1), ((2,1), 7), ((2,2), 2), ((2,3), 3), ((3,1), 9), ((3,2), 0), ((3,3), 1))
I want to do this operation
rdd_value[(i, j)] = rdd_value[(i, j)] + min(rdd_value[(i-1, j)],rdd_value[(i, j-1)], rdd_value[(i-1, j-1)])
For example:
((1, 1), 4) = 4 + min(infinity, infinity, 0) = 4 + 0 = 4
4 5 1
7 2 3
9 0 1
Then
((1, 2), 5) = 5 + min(infinity, 4, infinity) = 5 + 4 = 9
4 9 1
7 2 3
9 0 1
Then
....
Then
((2, 2), 2) = 2 + min(7, 9, 4) = 2 + 4 = 6
4 9 1
7 6 3
9 0 1
Then
.....
((3, 3), 1) = 1 + min(3, 0, 2) = 1 + 0 = 1

A short answer is that the problem you try to solve cannot be efficiently and concisely expressed using Spark. It doesn't really matter if you choose plain RDDs are distributed matrices.
To understand why you'll have to think about the Spark programming model. A fundamental Spark concept is a graph of dependencies where each RDD depends on one or more parent RDDs. If your problem was defined as follows:
given an initial matrix M0
for i <- 1..n
find matrix Mi where Mi(m,n) = Mi - 1(m,n) + min(Mi - 1(m-1,n), Mi - 1(m-1,n-1), Mi - 1(m,n-1))
then it would be trivial to express using Spark API (pseudocode):
rdd
.flatMap(lambda ((i, j), v):
[((i + 1, j), v), ((i, j + 1), v), ((i + 1, j + 1), v)])
.reduceByKey(min)
.union(rdd)
.reduceByKey(add)
Unfortunately you are trying to express dependencies between individual values in the same data structure. Spark aside it a problem which is much harder to parallelize not to mention distribute.
This type of dynamic programming is hard to parallelize because at different points is completely or almost completely sequential. When you try to compute for example Mi(0,0) or Mi(m,n) there is nothing to parallelize. It is hard to distribute because it can generate complex dependencies between blocks.
There are non trivial ways to handle this in Spark by computing individual blocks and expressing dependencies between these blocks or using iterative algorithms and propagating messages over the explicit graph (GraphX) but this is far from easy to do it right.
At the end of the day there tools which can be much better choice for this type of computations than Spark.

Related

scala mixing view and strict collection in for expression

This piece of scala code mixes view with strict List in a for expression:
val list = List.range(1, 4)
def compute(n: Int) = {
println("Computing "+n)
n * 2
}
val view = for (n <- list.view; k<-List(1,2)) yield compute(n)
val x = view(0)
The output is:
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
Computing 1
Computing 1
I expected that it should just have the last 2 lines "Computing 1" in the output. Why would it computed all the values eagerly? And why it then recomputed the values again?
Arguably, access by index forces the view to be computed. Also, notice that you flatmap the list with something which is not lazy (k is not a view).
Compare the following:
// 0) Your example
val v0 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v0(0) // Computing 1
// Computing 1
// Computing 2
// Computing 2
// Computing 3
// Computing 3
// Computing 1
// Computing 1
v0(0) // Computing 1
// Computing 1
// 1) Your example, but access by head and not by index
val v1 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v1.head // Computing 1
// Computing 1
// 2) Do not mix views and strict lists
val v2 = List.range(1, 4).view.flatMap(n => List(1,2).view.map(k => compute(n)))
v2(0) // Computing 1
Regarding example 0, notice that views are not like streams; while streams do cache their results, lazy views do not (they just compute lazily, i.e., by-need, on access). It seems that indexed-access requires computing the entire list, and then another computation is needed to actually access the element by index.
You may ask why indexed access in example 2 does not compute the entire list. This requires an understanding of how things work underneath; in particular, we may see the difference of the method calls from example 0 wrt example 2 in the following excerpts:
Example 0
java.lang.Exception scala.collection.SeqViewLike$FlatMapped.$anonfun$index$1(SeqViewLike.scala:75)
at scala.collection.SeqViewLike$FlatMapped.index(SeqViewLike.scala:74)
at scala.collection.SeqViewLike$FlatMapped.index$(SeqViewLike.scala:71)
at scala.collection.SeqViewLike$$anon$5.index$lzycompute(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$$anon$5.index(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.length(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$FlatMapped.length$(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$$anon$5.length(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:86)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
Example 2
java.lang.Exception scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:12)
at scala.collection.SeqViewLike$Mapped.apply(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$Mapped.apply$(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$$anon$4.apply(SeqViewLike.scala:196)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:88)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
In particular, you see that example 0 results in a call of Flatmapped.length (which needs to evaluate the entire list).
view here is a SeqView[Int,Seq[_]], which is immutable and recomputes every item when iterated over.
You could access just the first by explicitly using the .iterator:
# view.iterator.next
Computing 1
Computing 1
res11: Int = 2
Or explicitly make it a List (eg. if you need to reuse many entries):
# val view2: List[Int] = view.toList
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
view2: List[Int] = List(2, 2, 4, 4, 6, 6)
# view2(0)
res13: Int = 2

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })

understanding aggregate in Scala

I am trying to understand aggregate in Scala and with one example, i understood the logic, but the result of second one i tried confused me.
Please let me know, where i went wrong.
Code:
val list1 = List("This", "is", "an", "example");
val b = list1.aggregate(1)(_ * _.length(), _ * _)
1 * "This".length = 4
1 * "is".length = 2
1 * "an".length = 2
1 * "example".length = 7
4 * 2 = 8 , 2 * 7 = 14
8 * 14 = 112
the output also came as 112.
but for the below,
val c = list1.aggregate(1)(_ * _.length(), _ + _)
I Thought it will be like this.
4, 2, 2, 7
4 + 2 = 6
2 + 7 = 9
6 + 9 = 15,
but the output still came as 112.
It is ideally doing whatever the operation i mentioned at seqop, here _ * _.length
Could you please explain or correct me where i went wrong.?
aggregate should be used to compute only associative and commutative operations. Let's look at the signature of the function :
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
B can be seen as an accumulator (and will be your output). You give an initial output value, then the first function is how to add a value A to this accumulator and the second is how to merge 2 accumulators. Scala "chooses" a way to aggregate your collection but if your aggregation is not associative and commutative the output is not deterministic because the order matter. Look at this example :
val l = List(1, 2, 3, 4)
l.aggregate(0)(_ + _, _ * _)
If we create one accumulator and then aggregate all the values we get 1 + 2 + 3 + 4 = 10 but if we decide to parallelize the process by splitting the list in halves we could have (1 + 2) * (3 + 4) = 21.
So now what happens in reality is that for List aggregate is the same as foldLeft which explains why changing your second function didn't change the output. But where aggregate can be useful is in Spark for example or other distributed environments where it may be useful to do the folding on each partition independently and then combine the results with the second function.

Using Scala Breeze to do numPy style broadcasting

Is there a generic way using Breeze to achieve what you can do using broadcasting in NumPy?
Specifically, if I have an operator I'd like to apply to two 3x4 matrices, I can apply that operation element-wise. However, what I have is a 3x4 matrix and a 3-element column vector. I'd like a function which produces a 3x4 matrix created from applying the operator to each element of the matrix with the element from the vector for the corresponding row.
So for a division:
2 4 6 / 2 3 = 1 2 3
3 6 9 1 2 3
If this isn't available. I'd be willing to look at implementing it.
You can use mapPairs to achieve what I 'think' you're looking for:
val adder = DenseVector(1, 2, 3, 4)
val result = DenseMatrix.zeros[Int](3, 4).mapPairs({
case ((row, col), value) => {
value + adder(col)
}
})
println(result)
1 2 3 4
1 2 3 4
1 2 3 4
I'm sure you can adapt what you want from simple 'adder' above.
Breeze now supports broadcasting of this sort:
scala> val dm = DenseMatrix( (2, 4, 6), (3, 6, 9) )
dm: breeze.linalg.DenseMatrix[Int] =
2 4 6
3 6 9
scala> val dv = DenseVector(2,3)
dv: breeze.linalg.DenseVector[Int] = DenseVector(2, 3)
scala> dm(::, *) :/ dv
res4: breeze.linalg.DenseMatrix[Int] =
1 2 3
1 2 3
The * operator says which axis to broadcast along. Breeze doesn't allow implicit broadcasting, except for scalar types.

Understanding the scala substitution model through the use of sumInts method

I'm doing a scala course and one of the examples given is the sumInts function which is defined like :
def sumInts(a: Int, b: Int) : Int =
if(a > b) 0
else a + sumInts(a + 1 , b)
I've tried to understand this function better by outputting some values as its being iterated upon :
class SumInts {
def sumInts(a: Int, b: Int) : Int =
if(a > b) 0 else
{
println(a + " + sumInts("+(a + 1)+" , "+b+")")
val res1 = sumInts(a + 1 , b)
val res2 = a
val res3 = res1 + res2
println("res1 is : "+res1+", res2 is "+res2+", res3 is "+res3)
res3
}
}
So the code :
object SumIntsMain {
def main(args: Array[String]) {
println(new SumInts().sumInts(3 , 6));
}
}
Returns the output :
3 + sumInts(4 , 6)
4 + sumInts(5 , 6)
5 + sumInts(6 , 6)
6 + sumInts(7 , 6)
res1 is : 0, res2 is 6, res3 is 6
res1 is : 6, res2 is 5, res3 is 11
res1 is : 11, res2 is 4, res3 is 15
res1 is : 15, res2 is 3, res3 is 18
18
Can someone explain how these values are computed. I've tried by outputting all of the created variables but still im confused.
manual-human-tracer on:
return sumInts(3, 6) | a = 3, b = 6
3 > 6 ? NO
return 3 + sumInts(3 + 1, 6) | a = 4, b = 6
4 > 6 ? NO
return 3 + (4 + sumInts(4 + 1, 6)) | a = 5, b = 6
5 > 6 ? NO
return 3 + (4 + (5 + sumInts(5 + 1, 6))) | a = 6, b = 6
6 > 6 ? NO
return 3 + (4 + (5 + (6 + sumInts(6 + 1, 6)))) | a = 7, b = 6
7 > 6 ? YEEEEES (return 0)
return 3 + (4 + (5 + (6 + 0))) = return 18.
manual-human-tracer off.
To understand what recursive code does, it's not necessary to analyze the recursion tree. In fact, I believe it's often just confusing.
Pretending it works
Let's think about what we're trying to do: We want to sum all integers starting at a until some integer b.
a + sumInts(a + 1 , b)
Let us just pretend that sumInts(a + 1, b) actually does what we want it to: Summing the integers from a + 1 to b. If we accept this as truth, it's quite clear that our function will handle the larger problem, from a to b correctly. Because clearly, all that is missing from the sum is the additional term a, which is simply added. We conclude that it must work correctly.
A foundation: The base case
However, this sumInts() must be built on something: The base case, where no recursion is involved.
if(a > b) 0
Looking closely at our recursive call, we can see that it makes certain assumptions: we expect a to be lower than b. This implies that the sum will look like this: a + (a + 1) + ... + (b - 1) + b. If a is bigger than b, this sum naturally evaluates to 0.
Making sure it works
Seeing that sumInts() always increases a by one in the recursive call guarantees, that we will in fact hit the base case at some point.
Noticing further, that sumInts(b, b) will be called eventually, we can now verify that the code works: Since b is not greater than itself, the second case will be invoked: b + sumInts(b + 1, b). From here, it is obvious that this will evaluate to: b + 0, which means our algorithm works correctly for all values.
You mentioned the substitution model, so let's apply it to your sumInts method:
We start by calling sumInts(3,4) (you've used 6 as the second argument, but I chose 4, so I can type less), so let's substitute 3 for a and 4 for b in the definition of sumInts. This gives us:
if(3 > 4) 0
else 3 + sumInts(3 + 1, 4)
So, what will the result of this be? Well, 3 > 4 is clearly false, so the end result will be equal to the else clause, i.e. 3 plus the result of sumInts(4, 4) (4 being the result of 3+1). Now we need to know what the result of sumInts(4, 4) will be. For that we can substitute again (this time substituting 4 for a and b):
if(4 > 4) 0
else 4 + sumInts(4 + 1, 4)
Okay, so the result of sumInts(4,4) will be 4 plus the result of sumInts(5,4). So what's sumInts(5,4)? To the substitutionator!
if(5 > 4) 0
else 5 + sumInts(5 + 1, 4)
This time the if condition is true, so the result of sumInts(5,4) is 0. So now we know that the result of sumInts(4,4) must be 4 + 0 which is 4. And thus the result of sumInts(3,4) must be 3 + 4, which is 7.