Example of usage of a monoid for distributed computation with spark - scala

I have user hobby data(RDD[Map[String, Int]]) like:
("food" -> 3, "music" -> 1),
("food" -> 2),
("game" -> 5, "twitch" -> 3, "food" -> 3)
I want to calculate stats of them, and represent the stats as Map[String, Array[Int]] while the array size is 5, like:
("food" -> Array(0, 1, 2, 0, 0),
"music" -> Array(1, 0, 0, 0, 0),
"game" -> Array(0, 0, 0, 0, 1),
"twitch" -> Array(0, 0, 1, 0 ,0))
foldLeft seems to be the right solution, but RDD cannot use it, and the data is too big to convert to List/Array to use foldLeft, how could I do this job?

The trick is to replace the Array in your example by a class that contains the statistic you want for some part of the data, and that can be combined with another instance of the same statistic (covering other part of the data) to provide the statistic on the whole data.
For instance, if you have a statistic that covers the data 3, 3, 2 and 5, I gather it would look something like (0, 1, 2, 0, 1) and if you have another instance covering the data 3,4,4 it would look like (0, 0, 1, 2,0). Now all you have to do is define a + operation that let you combine (0, 1, 2, 0, 1) + (0, 0, 1, 2, 0) = (0,1,3,2,1), covering the data 3,3,2,5 and 3,4,4.
Let's just do that, and call the class StatMonoid:
case class StatMonoid(flags: Seq[Int] = Seq(0,0,0,0,0)) {
def + (other: StatMonoid) =
new StatMonoid( (0 to 4).map{idx => flags(idx) + other.flags(idx)})
}
This class contains the sequence of 5 counters, and define a + operation that let it be combined with other counters.
We also need a convenience method to build it, this could be a constructor in StatMonoid, in the companion object, or just a plain method, as you prefer:
def stat(value: Int): StatMonoid = value match {
case 1 => new StatMonoid(Seq(1,0,0,0,0))
case 2 => new StatMonoid(Seq(0,1,0,0,0))
case 3 => new StatMonoid(Seq(0,0,1,0,0))
case 4 => new StatMonoid(Seq(0,0,0,1,0))
case 5 => new StatMonoid(Seq(0,0,0,0,1))
case _ => throw new RuntimeException("illegal init value: $value")
}
This allows us to easily compute instance of the statistic covering one single piece of data, for example:
scala> stat(4)
res25: StatMonoid = StatMonoid(List(0, 0, 0, 1, 0))
And it also allows us to combine them together simply by adding them:
scala> stat(1) + stat(2) + stat(2) + stat(5) + stat(5) + stat(5)
res18: StatMonoid = StatMonoid(Vector(1, 2, 0, 0, 3))
Now to apply this to your example, let's assume we have the data you mention as an RDD of Map:
val rdd = sc.parallelize(List(Map("food" -> 3, "music" -> 1), Map("food" -> 2), Map("game" -> 5, "twitch" -> 3, "food" -> 3)))
All we need to do to find the stat for each kind of food, is to flatten the data to get ("foodId" -> id) tuples, transform each id into an instance of StatMonoid above, and finally combine them all together for each kind of food:
import org.apache.spark.rdd.PairRDDFunctions
rdd.flatMap(_.toList).mapValue(stat).reduceByKey(_ + _).collect
Which yields:
res24: Array[(String, StatMonoid)] = Array((game,StatMonoid(List(0, 0, 0, 0, 1))), (twitch,StatMonoid(List(0, 0, 1, 0, 0))), (music,StatMonoid(List(1, 0, 0, 0, 0))), (food,StatMonoid(Vector(0, 1, 2, 0, 0))))
Now, for the side story, if you wonder why I call the class StateMonoid it's simply because... it is a monoid :D, and a very common and handy one, called product . In short, monoids are just thingies that can be combined with each other in associative fashion, they are super common when developing in Spark since they naturally define operations that can be executed in parallel on the distributed slaves, and gathered together into a final result.

Related

Scala Pairs: how to count the number of the ocurrences in the value (list of numbers) [duplicate]

This question already has answers here:
Scala how can I count the number of occurrences in a list
(17 answers)
Closed 3 years ago.
I have a RDD[(Int, ListBuffer[Byte])] and I like to perform a "wordcount" but for each number in the List.
For instance, the RDD is:
(31000,ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1))
(21010,ListBuffer(0, 0, 0))
(23000,ListBuffer(1, 1, 1, 1, 1))
(01000,ListBuffer(1, 1))
(34000,ListBuffer(0))
And I want to get this:
(31000,(0,2),(1,7)) // this could be a Map[0=>2, 1=>7]
(21010,(0,3))
(23000,(1,5))
(01000,(1,2))
(34000,(0,3))
Any guidance? Thank you in advance
Edit: someone suggested my question was duplicated, but the thing is the suggested post was about only a List, but I wanted to apply on a Pair (Int, List).
The most idiomatic way to get a histogram of values in a Scala collection is to use groupBy followed by a map that takes the size of each resulting group:
scala> import collection.mutable.ListBuffer
import collection.mutable.ListBuffer
scala> val values = ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1)
values: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1)
scala> values.groupBy(identity).mapValues(_.size)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 7, 0 -> 2)
In your case that part is completely independent from the Spark part—you just happen to be performing this operation on values in an RDD, but the complete solution would look like this:
scala> val counts = myRdd.mapValues(_.groupBy(identity).mapValues(_.size))
counts: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Map[Int,Int])] = MapPartitionsRDD[1] at mapValues at <console>:26
scala> counts.foreach(println)
(1000,Map(1 -> 2))
(21010,Map(0 -> 3))
(23000,Map(1 -> 5))
(34000,Map(0 -> 1))
(31000,Map(1 -> 7, 0 -> 2))
It's worth noting that the mapValues on Scala collections is lazy, which means that every time you use the maps in the RDD the values will be recomputed. This is probably fine, but if you're concerned, you can replace it with something like this:
values.groupBy(identity).map { case (k, v) => k -> v.size }
…which will return a strictly-evaluated map.

Scala - Scanleft, returning values within iterations without using it in a next turn

I have a list
what I would like to do is
def someRandomMethod(...): ... = {
val list = List(1, 2, 3, 4, 5)
if(list.isEmpty) return list
list.differentScanLeft(list.head)((a, b) => {
a * b
})
}
which returns
List(1, 2, 6, 12, 20) rather than List(1, 2, 6, 24, 120)
Is there such API?
Thank you,
Cosmir
scan is not really the right method for this, you want to use sliding to generate a list of adjacent pairs of values:
(1::list).sliding(2).map(l => l(0)*l(1))
More generally, it is sometimes necessary to pass data on to the next iteration when using scan. The standard solution to this is to use a tuple (state, ans) and then filter out the state with another map at the end.
You probably want to use zip
val list = List(1,2,3,4,5,6)
val result = list.zip(list.drop(1)).map{case (a,b) => a*b}
println(result)
> List(2, 6, 12, 20, 30)

Operate on neighbor elements in RDD in Spark

As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
}
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?
The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.sortBy(identity)
.sliding(2)
.map{case Array(x, y) => y - x}
Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
So
println(diffs.collect.toSeq)
shows
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.

Scala trying to count instances of a digit in a number

This is my first day using scala. I am trying to make a string of the number of times each digit is represented in a string. For instance, the number 4310227 would return "1121100100" because 0 appears once, 1 appears once, 2 appears twice and so on...
def pow(n:Int) : String = {
val cubed = (n * n * n).toString
val digits = 0 to 9
val str = ""
for (a <- digits) {
println(a)
val b = cubed.count(_==a.toString)
println(b)
}
return cubed
}
and it doesn't seem to work. would like some scalay reasons why and to know whether I should even be going about it in this manner. Thanks!
When you iterate over strings, which is what you are doing when you call String#count(), you are working with Chars, not Strings. You don't want to compare these two with ==, since they aren't the same type of object.
One way to solve this problem is to call Char#toString() before performing the comparison, e.g., amend your code to read cubed.count(_.toString==a.toString).
As Rado and cheeken said, you're comparing a Char with a String, which will never be be equal. An alternative to cheekin's answer of converting each character to a string is to create a range from chars, ie '0' to '9':
val digits = '0' to '9'
...
val b = cubed.count(_ == a)
Note that if you want the Int that a Char represents, you can call char.asDigit.
Aleksey's, Ren's and Randall's answers are something you will want to strive towards as they separate out the pure solution to the problem. However, given that it's your first day with Scala, depending on what background you have, you might need a bit more context before understanding them.
Fairly simple:
scala> ("122333abc456xyz" filter (_.isDigit)).foldLeft(Map.empty[Char, Int]) ((histo, c) => histo + (c -> (histo.getOrElse(c, 0) + 1)))
res1: scala.collection.immutable.Map[Char,Int] = Map(4 -> 1, 5 -> 1, 6 -> 1, 1 -> 1, 2 -> 2, 3 -> 3)
This is perhaps not the fastest approach because intermediate datatype like String and Char are used but one of the most simplest:
def countDigits(n: Int): Map[Int, Int] =
n.toString.groupBy(x => x) map { case (n, c) => (n.asDigit, c.size) }
Example:
scala> def countDigits(n: Int): Map[Int, Int] = n.toString.groupBy(x => x) map { case (n, c) => (n.asDigit, c.size) }
countDigits: (n: Int)Map[Int,Int]
scala> countDigits(12345135)
res0: Map[Int,Int] = Map(5 -> 2, 1 -> 2, 2 -> 1, 3 -> 2, 4 -> 1)
Where myNumAsString is a String, eg "15625"
myNumAsString.groupBy(x => x).map(x => (x._1, x._2.length))
Result = Map(2 -> 1, 5 -> 2, 1 -> 1, 6 -> 1)
ie. A map containing the digit with its corresponding count.
What this is doing is taking your list, grouping the values by value (So for the initial string of "15625", it produces a map of 1 -> 1, 2 -> 2, 6 -> 6, and 5 -> 55.). The second bit just creates a map of the value to the count of how many times it occurs.
The counts for these hundred digits happen to fit into a hex digit.
scala> val is = for (_ <- (1 to 100).toList) yield r.nextInt(10)
is: List[Int] = List(8, 3, 9, 8, 0, 2, 0, 7, 8, 1, 6, 9, 9, 0, 3, 6, 8, 6, 3, 1, 8, 7, 0, 4, 4, 8, 4, 6, 9, 7, 4, 6, 6, 0, 3, 0, 4, 1, 5, 8, 9, 1, 2, 0, 8, 8, 2, 3, 8, 6, 4, 7, 1, 0, 2, 2, 6, 9, 3, 8, 6, 7, 9, 5, 0, 7, 6, 8, 7, 5, 8, 2, 2, 2, 4, 1, 2, 2, 6, 8, 1, 7, 0, 7, 6, 9, 5, 5, 5, 3, 5, 8, 2, 5, 1, 9, 5, 7, 2, 3)
scala> (new Array[Int](10) /: is) { case (a, i) => a(i) += 1 ; a } map ("%x" format _) mkString
warning: there were 1 feature warning(s); re-run with -feature for details
res7: String = a8c879caf9
scala> (new Array[Int](10) /: is) { case (a, i) => a(i) += 1 ; a } sum
warning: there were 1 feature warning(s); re-run with -feature for details
res8: Int = 100
I was going to point out that no one used a char range, but now I see Kristian did.
def pow(n:Int) : String = {
val cubed = (n * n * n).toString
val cnts = for (a <- '0' to '9') yield cubed.count(_ == a)
(cnts map (c => ('0' + c).toChar)).mkString
}

Shuffling part of an ArrayBuffer in-place

I need to shuffle part of an ArrayBuffer, preferably in-place so no copies are required. For example, if an ArrayBuffer has 10 elements, and I want to shuffle elements 3-7:
// Unshuffled ArrayBuffer of ints numbered 0-9
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
// Region I want to shuffle is between the pipe symbols (3-7)
0, 1, 2 | 3, 4, 5, 6, 7 | 8, 9
// Example of how it might look after shuffling
0, 1, 2 | 6, 3, 5, 7, 4 | 8, 9
// Leaving us with a partially shuffled ArrayBuffer
0, 1, 2, 6, 3, 5, 7, 4, 8, 9
I've written something like shown below, but it requires copies and iterating over loops a couple of times. It seems like there should be a more efficient way of doing this.
def shufflePart(startIndex: Int, endIndex: Int) {
val part: ArrayBuffer[Int] = ArrayBuffer[Int]()
for (i <- startIndex to endIndex ) {
part += this.children(i)
}
// Shuffle the part of the array we copied
val shuffled = this.random.shuffle(part)
var count: Int = 0
// Overwrite the part of the array we chose with the shuffled version of it
for (i <- startIndex to endIndex ) {
this.children(i) = shuffled(count)
count += 1
}
}
I could find nothing about partially shuffling an ArrayBuffer with Google. I assume I will have to write my own method, but in doing so I would like to prevent copies.
If you can use a subtype of ArrayBuffer you can access the underlying array directly, since ResizableArray has a protected member array:
import java.util.Collections
import java.util.Arrays
import collection.mutable.ArrayBuffer
val xs = new ArrayBuffer[Int]() {
def shuffle(a: Int, b: Int) {
Collections.shuffle(Arrays.asList(array: _*).subList(a, b))
}
}
xs ++= (0 to 9) // xs = ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
xs.shuffle(3, 8) // xs = ArrayBuffer(0, 1, 2, 4, 6, 5, 7, 3, 8, 9)
Notes:
The upper bound for java.util.List#subList is exclusive
It's reasonably efficient because Arrays#asList doesn't create a new set of elements: it's backed by the array itself (hence no add or remove methods)
If using for real, you probably want to add bounds-checks on a and b
I'm not entirely sure why it must be in place - you might want to think that over. But, assuming it's the right thing to do, this could do it:
import scala.collection.mutable.ArrayBuffer
implicit def array2Shufflable[T](a: ArrayBuffer[T]) = new {
def shufflePart(start: Int, end: Int) = {
val seq = (start.max(0) until end.min(a.size - 1)).toSeq
seq.zip(scala.util.Random.shuffle(seq)) foreach { t =>
val x = a(t._1)
a.update(t._1, a(t._2))
a(t._2) = x
}
a
}
}
You can use it like:
val a = ArrayBuffer(1,2,3,4,5,6,7,8,9)
println(a)
println(a.shufflePart(2, 7))
edit: If you're willing to pay the extra cost of an intermediate sequence, this is more reasonable, algorithmically speaking:
def shufflePart(start: Int, end: Int) = {
val seq = (start.max(0) until end.min(a.size - 1)).toSeq
seq.zip(scala.util.Random.shuffle(seq) map { i =>
a(i)
}) foreach { t =>
a.update(t._1, t._2)
}
a
}
}