reduceByKey: How does it work internally? - scala

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The map function is clear: s is the key and it points to the line from data.txt and 1 is the value.
However, I didn't get how the reduceByKey works internally? Does "a" points to the key? Alternatively, does "a" point to "s"? Then what does represent a + b? how are they filled?

Let's break it down to discrete methods and types. That usually exposes the intricacies for new devs:
pairs.reduceByKey((a, b) => a + b)
becomes
pairs.reduceByKey((a: Int, b: Int) => a + b)
and renaming the variables makes it a little more explicit
pairs.reduceByKey((accumulatedValue: Int, currentValue: Int) => accumulatedValue + currentValue)
So, we can now see that we are simply taking an accumulated value for the given key and summing it with the next value of that key. NOW, let's break it further so we can understand the key part. So, let's visualize the method more like this:
pairs.reduce((accumulatedValue: List[(String, Int)], currentValue: (String, Int)) => {
//Turn the accumulated value into a true key->value mapping
val accumAsMap = accumulatedValue.toMap
//Try to get the key's current value if we've already encountered it
accumAsMap.get(currentValue._1) match {
//If we have encountered it, then add the new value to the existing value and overwrite the old
case Some(value : Int) => (accumAsMap + (currentValue._1 -> (value + currentValue._2))).toList
//If we have NOT encountered it, then simply add it to the list
case None => currentValue :: accumulatedValue
}
})
So, you can see that the reduceByKey takes the boilerplate of finding the key and tracking it so that you don't have to worry about managing that part.
Deeper, truer if you want
All that being said, that is a simplified version of what happens as there are some optimizations that are done here. This operation is associative, so the spark engine will perform these reductions locally first (often termed map-side reduce) and then once again at the driver. This saves network traffic; instead of sending all the data and performing the operation, it can reduce it as small as it can and then send that reduction over the wire.

One requirement for the reduceByKey function is that is must be associative. To build some intuition on how reduceByKey works, let's first see how an associative associative function helps us in a parallel computation:
As we can see, we can break an original collection in pieces and by applying the associative function, we can accumulate a total. The sequential case is trivial, we are used to it: 1+2+3+4+5+6+7+8+9+10.
Associativity lets us use that same function in sequence and in parallel. reduceByKey uses that property to compute a result out of an RDD, which is a distributed collection consisting of partitions.
Consider the following example:
// collection of the form ("key",1),("key,2),...,("key",20) split among 4 partitions
val rdd =sparkContext.parallelize(( (1 to 20).map(x=>("key",x))), 4)
rdd.reduceByKey(_ + _)
rdd.collect()
> Array[(String, Int)] = Array((key,210))
In spark, data is distributed into partitions. For the next illustration, (4) partitions are to the left, enclosed in thin lines. First, we apply the function locally to each partition, sequentially in the partition, but we run all 4 partitions in parallel. Then, the result of each local computation are aggregated by applying the same function again and finally come to a result.
reduceByKey is an specialization of aggregateByKey aggregateByKey takes 2 functions: one that is applied to each partition (sequentially) and one that is applied among the results of each partition (in parallel). reduceByKey uses the same associative function on both cases: to do a sequential computing on each partition and then combine those results in a final result as we have illustrated here.

In your example of
val counts = pairs.reduceByKey((a,b) => a+b)
a and b are both Int accumulators for _2 of the tuples in pairs. reduceKey will take two tuples with the same value s and use their _2 values as a and b, producing a new Tuple[String,Int]. This operation is repeated until there is only one tuple for each key s.
Unlike non-Spark (or, really, non-parallel) reduceByKey where the first element is always the accumulator and the second a value, reduceByKey operates in a distributed fashion, i.e. each node will reduce it's set of tuples into a collection of uniquely-keyed tuples and then reduce the tuples from multiple nodes until there is a final uniquely-keyed set of tuples. This means as the results from nodes are reduced, a and b represent already reduced accumulators.

Spark RDD reduceByKey function merges the values for each key using an associative reduce function.
The reduceByKey function works only on the RDDs and this is a transformation operation that means it is lazily evaluated. And an associative function is passed as a parameter, which is applied to source RDD and creates a new RDD as a result.
So in your example, rdd pairs has a set of multiple paired elements like (s1,1), (s2,1) etc. And reduceByKey accepts a function (accumulator, n) => (accumulator + n), which initialise the accumulator variable to default value 0 and adds up the element for each key and return the result rdd counts having the total counts paired with key.

Simple if your input RDD data look like this:
(aa,1)
(bb,1)
(aa,1)
(cc,1)
(bb,1)
and if you apply reduceByKey on above rdd data then few you have to remember,
reduceByKey always takes 2 input (x,y) and always works with two rows at a time.
As it is reduceByKey it will combine two rows of same key and combine the result of value.
val rdd2 = rdd.reduceByKey((x,y) => x+y)
rdd2.foreach(println)
output:
(aa,2)
(bb,2)
(cc,1)

Related

Spark calculate the percentage of appearances of words

I've a PairRDD like this (word, wordCount). Now, I need to calculate for each word the percentage of appearances on the total of the words count, obtaining a final PairRDD like this (word, (wordCount, percentage)).
I tried with:
val pairs = .... .cache() // (word, wordCount)
val wordsCount = pairs.map(_._2).reduce(_+_)
pairs.map{
case (word, count) => (word, (count, BigDecimal(count/wordsCount.toDouble * 100).setScale(3, BigDecimal.RoundingMode.HALF_UP).toDouble))
}
But it does not seem very efficient (i'm a beginner). Is there a better way to do it?
When calculating wordsCount, map(_._2).reduce(_ + _) is more idiomatically expressed by the aggregate method:
val pairs = .... .cache() // (word, wordCount)
val wordsCount = pairs.aggregate(0)((sum, pair) => sum + pair._2, _ + _)
The first argument is the initial total count, which should be zero. The next two arguments are in a separate argument list, and are executed for each member of your pair RDD. The first of this second set of arguments ((sum, pair) => sum + pair._2, termed the sequence operator) adds each value in a partition to the current sum value for that partition; while the second (_ + _, termed the combination operator) combines counts from different partitions. The main benefit is that this operation is performed in parallel, by each partition, across the data, and guarantees no expensive repartitioning (a.k.a. shuffling) of that data. (Repartitioning occurs when data needs to be transferred between nodes in the cluster, and is very slow due to network communication.)
While aggregate does not affect partitioning of the data, you should note that, if your data is partitioned, a map operation will remove the partitioning scheme - because of the possibility that the keys in the pair RDD might get changed by the transformation. This might resulting in subsequent repartitioning if you perform any further transformations on the result of a map transformation. That said, a map operation followed by a reduce shouldn't result in repartitioning, but might be slightly less efficient because of the extra step (but not hugely so).
If you're concerned that the total word count might overflow an Int type, you could use a BigInt instead:
val wordsCount = pairs.aggregate(BigInt(0))((sum, pair) => sum + pair._2, _ + _)
As for producing a pair RDD with the word counts and percentages as the value, you should use mapValues instead of map. Again mapValues will retain any existing partitioning scheme (because it guarantees that the keys will not get changed by the transformation), while map will remove it. Furthermore, mapValues is simpler in that you do not need to process key values:
pairs.mapValues(c => (c, c * 100.0 / wordsCount))
That should provide sufficient accuracy for your purposes. I would only worry about rounding, and using BigDecimal, when retrieving and outputting values. However, if you really do need that, your code would look like this:
pairs.mapValues{c =>
(c, BigDecimal(c * 100.0 / wordsCount).setScale(3, BigDecimal.RoundingMode.HALF_UP).toDouble))
}

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.
Like what i think,
reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))
But this is not working any suggestion please.
Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:
val temp1 = tempTransform.map { temp =>
((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.
That's actually the classic map-reduce example - word count.
No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.
If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

Can only zip RDDs with same number of elements in each partition despite repartition

I load a dataset
val data = sc.textFile("/home/kybe/Documents/datasets/img.csv",defp)
I want to put an index on this data thus
val nb = data.count.toInt
val tozip = sc.parallelize(1 to nb).repartition(data.getNumPartitions)
val res = tozip.zip(data)
Unfortunately i have the following error
Can only zip RDDs with same number of elements in each partition
How can i modify the number of element by partition if it is possible ?
Why it doesn't work?
The documentation for zip() states:
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
So we need to make sure we meet 2 conditions:
both RDDs have the same number of partitions
respective partitions in those RDDs have exactly the same size
You are making sure that you will have the same number of partitions with repartition() but Spark doesn't guarantee that you will have the same distribution in each partition for each RDD.
Why is that?
Because there are different types of RDDs and most of them have different partitioning strategies! For example:
ParallelCollectionRDD is created when you parallelise a collection with sc.parallelize(collection) it will see how many partitions there should be, will check the size of the collection and calculate the step size. I.e. you have 15 elements in the list and want 4 partitions, first 3 will have 4 consecutive elements last one will have the remaining 3.
HadoopRDD if I remember correctly, one partition per file block. Even though you are using a local file internally Spark first creates a this kind of RDD when you read a local file and then maps that RDD since that RDD is a pair RDD of <Long, Text> and you just want String :-)
etc.etc.
In your example Spark internally does create different types of RDDs (CoalescedRDD and ShuffledRDD) while doing the repartitioning but I think you got the global idea that different RDDs have different partitioning strategies :-)
Notice that the last part of the zip() doc mentions the map() operation. This operation does not repartition as it's a narrow transformation data so it would guarantee both conditions.
Solution
In this simple example as it was mentioned you can do simply data.zipWithIndex. If you need something more complicated then creating the new RDD for zip() should be created with map() as mentioned above.
I solved this by creating an implicit helper like so
implicit class RichContext[T](rdd: RDD[T]) {
def zipShuffle[A](other: RDD[A])(implicit kt: ClassTag[T], vt: ClassTag[A]): RDD[(T, A)] = {
val otherKeyd: RDD[(Long, A)] = other.zipWithIndex().map { case (n, i) => i -> n }
val thisKeyed: RDD[(Long, T)] = rdd.zipWithIndex().map { case (n, i) => i -> n }
val joined = new PairRDDFunctions(thisKeyed).join(otherKeyd).map(_._2)
joined
}
}
Which can then be used like
val rdd1 = sc.parallelize(Seq(1,2,3))
val rdd2 = sc.parallelize(Seq(2,4,6))
val zipped = rdd1.zipShuffle(rdd2) // Seq((1,2),(2,4),(3,6))
NB: Keep in mind that the join will cause a shuffle.
The following provides a Python answer to this problem by defining a custom_zip method:
Can only zip with RDD which has the same number of partitions error

Checking if an RDD element is in another using the map function

I'm new to Spark and was wondering about closures.
I have two RDDs, one containing a list of IDs and a values, and the other containing a list of selected IDs.
Using a map, I want to increase the value of the element, if the other RDD contains its ID, like so.
val ids = sc.parallelize(List(1,2,10,5))
val vals = sc.parallelize(List((1, 0), (2, 0), (3,0), (4,0)))
vals.map( v => {
if(ids.collect().contains(v._1)){
(v._1, 1)
}
})
However the job hangs and never completes.
What is the proper way to do this,
Thanks for your help!
Your implementation tries to use one RDD (ids) inside a closure used to map another - this isn't allowed in Spark applications: anything to be used in a closure must be serializable (and preferably small), since it will be serialized and sent to each worker.
a leftOuterJoin between these RDDs should get you what you want:
val ids = sc.parallelize(List(1,2,10,5))
val vals = sc.parallelize(List((1, 0), (2, 0), (3,0), (4,0)))
val result = vals
.leftOuterJoin(ids.keyBy(i => i))
.mapValues({
case (v, Some(matchingId)) => v + 1 // increase value if match found
case (v, None) => v // leave value as-is otherwise
})
The leftOuterJoin expects two key-value RDDs, hence we artificially extract a key from the ids RDD using the identity function. Then we map the values of each resulting (id: Int, (value: Int, matchingId: Option[Int])) record into either v or v+1.
Generally, you should always aim to minimize the use of actions like collect when using Spark, as such actions move data back from the distributed cluster into your driver application.

Spark closure argument binding

I am working with Apache Spark in Scala.
I have a problem when trying to manipulate one RDD with data from a second RDD. I am trying to pass the 2nd RDD as an argument to a function being 'mapped' against the first RDD, but seemingly the closure created on that function binds an uninitialized version of that value.
Following is a simpler piece of code that shows the type of problem I'm seeing. (My real example where I first had trouble is larger and less understandable).
I don't really understand the argument binding rules for Spark closures.
What I'm really looking for is a basic approach or pattern for how to manipulate one RDD using the content of another (which was previously constructed elsewhere).
In the following code, calling Test1.process(sc) will fail with a null pointer access in findSquare (as the 2nd arg bound in the closure is not initialized)
object Test1 {
def process(sc: SparkContext) {
val squaresMap = (1 to 10).map(n => (n, n * n))
val squaresRDD = sc.parallelize(squaresMap)
val primes = sc.parallelize(List(2, 3, 5, 7))
for (p <- primes) {
println("%d: %d".format(p, findSquare(p, squaresRDD)))
}
}
def findSquare(n: Int, squaresRDD: RDD[(Int, Int)]): Int = {
squaresRDD.filter(kv => kv._1 == n).first._1
}
}
Problem you experience has nothing to do with closures or RDDs which, contrary to popular belief, are serializable.
It is simply breaks a fundamental Spark rule which states that you cannot trigger an action or transformation from another action or transformation* and different variants of this question have been asked on SO multiple times.
To understand why that's the case you have to think about the architecture:
SparkContext is managed on the driver
everything that happens inside transformations is executed on the workers. Each worker have access only to its own part of the data and don't communicate with other workers**.
If you want to use content of multiple RDDs you have to use one of the transformations which combine RDDs, like join, cartesian, zip or union.
Here you most likely (I am not sure why you pass tuple and use only first element of this tuple) want to either use a broadcast variable:
val squaresMapBD = sc.broadcast(squaresMap)
def findSquare(n: Int): Seq[(Int, Int)] = {
squaresMapBD.value
.filter{case (k, v) => k == n}
.map{case (k, v) => (n, k)}
.take(1)
}
primes.flatMap(findSquare)
or Cartesian:
primes
.cartesian(squaresRDD)
.filter{case (n, (k, _)) => n == k}.map{case (n, (k, _)) => (n, k)}
Converting primes to dummy pairs (Int, null) and join would be more efficient:
primes.map((_, null)).join(squaresRDD).map(...)
but based on your comments I assume you're interested in a scenario when there is natural join condition.
Depending on a context you can also consider using database or files to store common data.
On a side note RDDs are not iterable so you cannot simply use for loop. To be able to do something like this you have to collect or convert toLocalIterator first. You can also use foreach method.
* To be precise you cannot access SparkContext.
** Torrent broadcast and tree aggregates involve communication between executors so it is technically possible.
RDD are not serializable, so you can't use an rdd inside an rdd trasformation.
Then I've never seen enumerate an rdd with a for statement, usually I use foreach statement that is part of rdd api.
In order to combine data from two rdd, you can leverage join, union or broadcast ( in case your rdd is small)