Operate on neighbor elements in RDD in Spark - scala

As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?

The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.map{case Array(x, y) => y - x}

Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.


I have a RDD[(Int, ListBuffer[Byte])] and I like to perform a "wordcount" but for each number in the List.
For instance, the RDD is:
(31000,ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1))
(21010,ListBuffer(0, 0, 0))
(23000,ListBuffer(1, 1, 1, 1, 1))
(01000,ListBuffer(1, 1))
And I want to get this:
(31000,(0,2),(1,7)) // this could be a Map[0=>2, 1=>7]
Any guidance? Thank you in advance
Edit: someone suggested my question was duplicated, but the thing is the suggested post was about only a List, but I wanted to apply on a Pair (Int, List).
The most idiomatic way to get a histogram of values in a Scala collection is to use groupBy followed by a map that takes the size of each resulting group:
scala> import collection.mutable.ListBuffer
import collection.mutable.ListBuffer
scala> val values = ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1)
values: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1)
scala> values.groupBy(identity).mapValues(_.size)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 7, 0 -> 2)
In your case that part is completely independent from the Spark part—you just happen to be performing this operation on values in an RDD, but the complete solution would look like this:
scala> val counts = myRdd.mapValues(_.groupBy(identity).mapValues(_.size))
counts: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Map[Int,Int])] = MapPartitionsRDD[1] at mapValues at <console>:26
scala> counts.foreach(println)
(1000,Map(1 -> 2))
(21010,Map(0 -> 3))
(23000,Map(1 -> 5))
(34000,Map(0 -> 1))
(31000,Map(1 -> 7, 0 -> 2))
It's worth noting that the mapValues on Scala collections is lazy, which means that every time you use the maps in the RDD the values will be recomputed. This is probably fine, but if you're concerned, you can replace it with something like this:
values.groupBy(identity).map { case (k, v) => k -> v.size }
…which will return a strictly-evaluated map.

Example of usage of a monoid for distributed computation with spark

I have user hobby data(RDD[Map[String, Int]]) like:
("food" -> 3, "music" -> 1),
("food" -> 2),
("game" -> 5, "twitch" -> 3, "food" -> 3)
I want to calculate stats of them, and represent the stats as Map[String, Array[Int]] while the array size is 5, like:
("food" -> Array(0, 1, 2, 0, 0),
"music" -> Array(1, 0, 0, 0, 0),
"game" -> Array(0, 0, 0, 0, 1),
"twitch" -> Array(0, 0, 1, 0 ,0))
foldLeft seems to be the right solution, but RDD cannot use it, and the data is too big to convert to List/Array to use foldLeft, how could I do this job?
The trick is to replace the Array in your example by a class that contains the statistic you want for some part of the data, and that can be combined with another instance of the same statistic (covering other part of the data) to provide the statistic on the whole data.
For instance, if you have a statistic that covers the data 3, 3, 2 and 5, I gather it would look something like (0, 1, 2, 0, 1) and if you have another instance covering the data 3,4,4 it would look like (0, 0, 1, 2,0). Now all you have to do is define a + operation that let you combine (0, 1, 2, 0, 1) + (0, 0, 1, 2, 0) = (0,1,3,2,1), covering the data 3,3,2,5 and 3,4,4.
Let's just do that, and call the class StatMonoid:
case class StatMonoid(flags: Seq[Int] = Seq(0,0,0,0,0)) {
def + (other: StatMonoid) =
new StatMonoid( (0 to 4).map{idx => flags(idx) + other.flags(idx)})
This class contains the sequence of 5 counters, and define a + operation that let it be combined with other counters.
We also need a convenience method to build it, this could be a constructor in StatMonoid, in the companion object, or just a plain method, as you prefer:
def stat(value: Int): StatMonoid = value match {
case 1 => new StatMonoid(Seq(1,0,0,0,0))
case 2 => new StatMonoid(Seq(0,1,0,0,0))
case 3 => new StatMonoid(Seq(0,0,1,0,0))
case 4 => new StatMonoid(Seq(0,0,0,1,0))
case 5 => new StatMonoid(Seq(0,0,0,0,1))
case _ => throw new RuntimeException("illegal init value: $value")
This allows us to easily compute instance of the statistic covering one single piece of data, for example:
scala> stat(4)
res25: StatMonoid = StatMonoid(List(0, 0, 0, 1, 0))
And it also allows us to combine them together simply by adding them:
scala> stat(1) + stat(2) + stat(2) + stat(5) + stat(5) + stat(5)
res18: StatMonoid = StatMonoid(Vector(1, 2, 0, 0, 3))
Now to apply this to your example, let's assume we have the data you mention as an RDD of Map:
val rdd = sc.parallelize(List(Map("food" -> 3, "music" -> 1), Map("food" -> 2), Map("game" -> 5, "twitch" -> 3, "food" -> 3)))
All we need to do to find the stat for each kind of food, is to flatten the data to get ("foodId" -> id) tuples, transform each id into an instance of StatMonoid above, and finally combine them all together for each kind of food:
import org.apache.spark.rdd.PairRDDFunctions
rdd.flatMap(_.toList).mapValue(stat).reduceByKey(_ + _).collect
Which yields:
res24: Array[(String, StatMonoid)] = Array((game,StatMonoid(List(0, 0, 0, 0, 1))), (twitch,StatMonoid(List(0, 0, 1, 0, 0))), (music,StatMonoid(List(1, 0, 0, 0, 0))), (food,StatMonoid(Vector(0, 1, 2, 0, 0))))
Now, for the side story, if you wonder why I call the class StateMonoid it's simply because... it is a monoid :D, and a very common and handy one, called product . In short, monoids are just thingies that can be combined with each other in associative fashion, they are super common when developing in Spark since they naturally define operations that can be executed in parallel on the distributed slaves, and gathered together into a final result.

Transform a Scala Stream to a new Stream which is the sum of the current element and the previous element

How to transform a Scala Stream of integers so that we have a new Stream where the elements are the sum of this element and the previous element.
By example if the input stream is 1, 2, 3, 4 ... then the output stream is 1, 3, 5, 7.
Also a second question, how would you make the sum use the previous one in the output stream so that the output would be 1, (2+(1)), (3+(2+1)), (4+(3+(2+1))).
Just zip your stream with a shifted version of itself and sum the two elements.
val s1 = Stream.from(0) // 0, 1, 2, 3, ...
val s2 = Stream.from(1) // 1, 2, 3, 4, ...
val sumOfTwo = s1.zip(s2).map{ case (a,b) => a+b } // 1, 3, 5, 7, ...
To compute the total sum, just use the scan function that acts like a fold but returning elements at each step.
val totalSum = s1.scan(0)((ctr, el) => ctr + el) // 0, 1, 3, 6, 10, ...
This answer computes the cumulative sum by using a variable for the accumulated result instead of scan(). Example program:
import scala.collection.immutable.Stream
object Main extends App {
// 1, 2, 3, ...
val naturals = Stream.from(1)
// cumulative sum (see https://stackoverflow.com/a/8567134/1071311)
def sumUp(s : Stream[Int], acc : Int = 0) : Stream[Int] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val firstFive = sumUp(naturals, 0).take(5)
firstFive.foreach(println _)

How to take the first distinct (until the moment) elements of a list?

I am sure there is an elegant/funny way of doing it,
but I can only think of a more or less complicated recursive solution.
Is there any standard lib (collections) method nor simple combination of them to take the first distinct elements of a list?
scala> val s = Seq(3, 5, 4, 1, 5, 7, 1, 2)
s: Seq[Int] = List(3, 5, 4, 1, 5, 7, 1, 2)
scala> s.takeWhileDistinct //Would return Seq(3,5,4,1), it should preserve the original order and ignore posterior occurrences of distinct values like 7 and 2.
If you want it to be fast-ish, then
{ val hs = scala.collection.mutable.HashSet[Int]()
s.takeWhile{ hs.add } }
will do the trick. (Extra braces prevent leaking the temp value hs.)
This is a short approach in a maximum of O(2logN).
implicit class ListOps[T](val s: Seq[T]) {
def takeWhileDistinct: Seq[T] = {
s.indexWhere(x => { s.count(x==) > 1 }) match {
case ind if (ind > 0) => s.take(
s.indexWhere(x => { s.count(x==) > 1 }, ind + 1) + ind).distinct
case _ => s
val ex = Seq(3, 5, 4, 5, 7, 1)
val ex2 = Seq(3, 5, 4, 1, 5, 7, 1, 5)
println(ex.takeWhileDistinct.mkString(", ")) // 3, 4, 5
println(ex2.takeWhileDistinct.mkString(", ")) // 3, 4, 5, 1
Look here for live results.
Interesting problem. Here's an alternative. First, let's get the stream of s so we can avoid unnecessary work (though the overhead is likely to be greater than the saved work, sadly).
val s = Seq(3, 5, 4, 5, 7, 1)
val ss = s.toStream
Now we can build s again, but keeping track of whether there are repetitions or not, and stopping at the first one:
val newS = ss.scanLeft(Seq[Int]() -> false) {
case ((seen, stop), current) =>
if (stop || (seen contains current)) (seen, true)
else ((seen :+ current, false))
Now all that's left is take the last element without repetition, and drop the flag:
val noRepetitionsS = newS.takeWhile(!_._2).last._1
A variation on Rex's (though I prefer his...)
This one is functional throughout, using the little-seen scanLeft method.
val prevs = xs.scanLeft(Set.empty[Int])(_ + _)
(xs zip prevs) takeWhile { case (x,prev) => !prev(x) } map {_._1}
And a lazy version (using iterators, for moar efficiency):
val prevs = xs.iterator.scanLeft(Set.empty[Int])(_ + _)
(prevs zip xs.iterator) takeWhile { case (prev,x) => !prev(x) } map {_._2}
Turn the resulting iterator back to a sequence if you want, but this'll also work nicely with iterators on both the input AND the output.
The problem is simpler than the std lib function I was looking for (takeWhileConditionOverListOfAllAlreadyTraversedItems):
scala> val s = Seq(3, 5, 4, 1, 5, 7, 1, 2)
scala> s.zip(s.distinct).takeWhile{case(a,b)=>a==b}.map(_._1)
res20: Seq[Int] = List(3, 5, 4, 1)

How to compare a list element with the next element, to yield this element?

As I noted in the title, how to compare the element of index N with element of index N+1, if elements compared are exactly the same, yield element only once.
I know I can use toSet, to get a set of unique elements, but this does not help me because, my list can contain duplicated elements but duplicated element can't be the next element in my list.
val ll = List(1, 2, 3, 6, 3, 7, 5, 5, 6, 3)
// Desired output: List(1, 2, 3, 6, 3, 7, 5, 6, 3)
I got a "near working solution" using zipWithIndex.collect, but when I compare inside it, index runs OutOfBounds. I can make this to work if I can use two conditions inside, first check maximum index to be index = (list.size-1) then I can compare list(index) != list(index+1) then yield list(index)
What I have tried without success (because of OutOfBounds), is:
case (element, index)
// index+1 will be incremented out of my list
if (times(index) != times(index+1)) => times(index)
This can work if I can use one more condition to limit index, but does not work with two conditions:
case (element, index)
if (index < times.size)
if (times(index) != times(index+1)) => times(index)
I appreciate any kind of alternative.
how about
ll.foldLeft(List[Int]())((acc, x) => acc match {case Nil => List(x) case y => if (y.last == x) y else y :+ x})
Here's my alternative using the sliding function:
val ll = List(1, 2, 3, 6, 3, 7, 5, 5, 6, 3)
.filter( t => t.length > 1 && t(0) != t(1) )
.map( t => t(0) )
.toList :+ ll.last
You can use zip the list with itself, dropping the first element so that you compare elements at index N with N + 1. You only need to append the last element (you may want to use a ListBuffer as appending the last element requires to copy the list).
val r = times.zip(times.drop(1)).withFilter(t => t._1 != t._2).map(_._1) :+ times.last
scala> val times = List(1, 2, 3, 6, 3, 7, 5, 5, 6, 3)
times: List[Int] = List(1, 2, 3, 6, 3, 7, 5, 5, 6, 3)
scala> val r = times.zip(times.drop(1)).withFilter(t => t._1 != t._2).map(_._1) :+ times.last
r: List[Int] = List(1, 2, 3, 6, 3, 7, 5, 6, 3)