Can only zip RDDs with same number of elements in each partition despite repartition - scala

I load a dataset
val data = sc.textFile("/home/kybe/Documents/datasets/img.csv",defp)
I want to put an index on this data thus
val nb = data.count.toInt
val tozip = sc.parallelize(1 to nb).repartition(data.getNumPartitions)
val res = tozip.zip(data)
Unfortunately i have the following error
Can only zip RDDs with same number of elements in each partition
How can i modify the number of element by partition if it is possible ?

Why it doesn't work?
The documentation for zip() states:
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
So we need to make sure we meet 2 conditions:
both RDDs have the same number of partitions
respective partitions in those RDDs have exactly the same size
You are making sure that you will have the same number of partitions with repartition() but Spark doesn't guarantee that you will have the same distribution in each partition for each RDD.
Why is that?
Because there are different types of RDDs and most of them have different partitioning strategies! For example:
ParallelCollectionRDD is created when you parallelise a collection with sc.parallelize(collection) it will see how many partitions there should be, will check the size of the collection and calculate the step size. I.e. you have 15 elements in the list and want 4 partitions, first 3 will have 4 consecutive elements last one will have the remaining 3.
HadoopRDD if I remember correctly, one partition per file block. Even though you are using a local file internally Spark first creates a this kind of RDD when you read a local file and then maps that RDD since that RDD is a pair RDD of <Long, Text> and you just want String :-)
etc.etc.
In your example Spark internally does create different types of RDDs (CoalescedRDD and ShuffledRDD) while doing the repartitioning but I think you got the global idea that different RDDs have different partitioning strategies :-)
Notice that the last part of the zip() doc mentions the map() operation. This operation does not repartition as it's a narrow transformation data so it would guarantee both conditions.
Solution
In this simple example as it was mentioned you can do simply data.zipWithIndex. If you need something more complicated then creating the new RDD for zip() should be created with map() as mentioned above.

I solved this by creating an implicit helper like so
implicit class RichContext[T](rdd: RDD[T]) {
def zipShuffle[A](other: RDD[A])(implicit kt: ClassTag[T], vt: ClassTag[A]): RDD[(T, A)] = {
val otherKeyd: RDD[(Long, A)] = other.zipWithIndex().map { case (n, i) => i -> n }
val thisKeyed: RDD[(Long, T)] = rdd.zipWithIndex().map { case (n, i) => i -> n }
val joined = new PairRDDFunctions(thisKeyed).join(otherKeyd).map(_._2)
joined
}
}
Which can then be used like
val rdd1 = sc.parallelize(Seq(1,2,3))
val rdd2 = sc.parallelize(Seq(2,4,6))
val zipped = rdd1.zipShuffle(rdd2) // Seq((1,2),(2,4),(3,6))
NB: Keep in mind that the join will cause a shuffle.

The following provides a Python answer to this problem by defining a custom_zip method:
Can only zip with RDD which has the same number of partitions error

Related

Compare two different RDDs using scala

I have two RDDs- one from hdfs file system and the other created from a string as shown below-
val txt=sc.textFile("/tmp/textFile.txt")
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str))
Now, I want two compare the data in these two RDDs:
OR
The result should be an empty RDD but that is not the case. Can someone please explain how I should compare the data of these two RDDs?
Values of the two rdds that you've created looks to be same but are not same. It is evident if you do the count of elements in both rdds as
txt.collect().count(!_.isEmpty)
//res0: Int = 3
strRDD.collect().count(!_.isEmpty)
//res1: Int = 1
The result should be an empty RDD but that is not the case.
Thats the reason the results of txt.subtract(strRDD) and strRDD.subtract(txt) are not same
val txt=sc.textFile("/tmp/textFile.txt") gives each line as separate element in txt RDD
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str)) gives one \n separated element in strRDD RDD
I hope the explanation is clear

Apache Spark's RDD splitting according to the particular size

I am trying to read strings from a text file, but I want to limit each line according to a particular size. For example;
Here is my representing the file.
aaaaa\nbbb\nccccc
When trying to read this file by sc.textFile, RDD would appear this one.
scala> val rdd = sc.textFile("textFile")
scala> rdd.collect
res1: Array[String] = Array(aaaaa, bbb, ccccc)
But I want to limit the size of this RDD. For example, if the limit is 3, then I should get like this one.
Array[String] = Array(aaa, aab, bbc, ccc, c)
What is the best performance way to do that?
Not a particularly efficient solution (not terrible either) but you can do something like this:
val pairs = rdd
.flatMap(x => x) // Flatten
.zipWithIndex // Add indices
.keyBy(_._2 / 3) // Key by index / n
// We'll use a range partitioner to minimize the shuffle
val partitioner = new RangePartitioner(pairs.partitions.size, pairs)
pairs
.groupByKey(partitioner) // group
// Sort, drop index, concat
.mapValues(_.toSeq.sortBy(_._2).map(_._1).mkString(""))
.sortByKey()
.values
It is possible to avoid the shuffle by passing data required to fill the partitions explicitly but it takes some effort to code. See my answer to Partition RDD into tuples of length n.
If you can accept some misaligned records on partitions boundaries then simple mapPartitions with grouped should do the trick at much lower cost:
rdd.mapPartitions(_.flatMap(x => x).grouped(3).map(_.mkString("")))
It is also possible to use sliding RDD:
rdd.flatMap(x => x).sliding(3, 3).map(_.mkString(""))
You will need to read all the data anyhow. Not much you can do apart from mapping each line and trim it.
rdd.map(line => line.take(3)).collect()

Why printing inside foreach doesn't reflect an order of elements

May be I am missing something but I expected the data to be sorted based on the key
scala> val x=sc.parallelize(Array( "cat", "ant", "1"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at <console>:22
scala> val xxx=x.map(v=> (v,v.length))
xxx: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[161] at map at <console>:26
scala> xxx.sortByKey().foreach(println)
(1,1)
(cat,3)
(ant,3)
scala> xxx.sortByKey().foreach(println)
(cat,3)
(1,1)
(ant,3)
It works if I tell spark to use only 1 partitions as below but how to make this work in a cluster or more than 1 workers?
scala> xxx.sortByKey(numPartitions=1).foreach(println)
(1,1)
(ant,3)
(cat,3)
UPDATE:
I think I got the answer. It is being sorted correctly as it works when I use the collect
scala> xxx.sortByKey().collect
res170: Array[(String, Int)] = Array((1,1), (ant,3), (cat,3))
Keeping the question open to validate my understanding.
That makes sense. foreach runs in parallel across the partitions which creates non-deterministic ordering. The order may be mixed. collect gives you an array of the partitions concatenated in their sorted order.
Have a look at spark documentation why collect() method fixed the issue for you.
e.g.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.
Calling collect() on the resulting RDD will return or output an ordered list of records
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Remember doing a collect() action operation on a very large distributed RDD can cause your driver program to run out of memory and crash. So, do not use collect() except for when you are prototyping your Spark program on a small dataset.
Have a look at this article for more details
EDIT:
sortByKey(): Sort the RDD by key, so that each partition contains a sorted range of the elements. Since all partitions may not reside in same Executor node, you will not get ordered set unless you call collect()

spark: how to zip an RDD with each partition of the other RDD

Let's say I have one RDD[U] that will always consist of only 1 partition. My task is fill this RDD up with the contents of another RDD[T] that resides over n number of partitions. The final output should be n number of partitions of RDD[U].
What I tried to do originally is:
val newRDD = firstRDD.zip(secondRDD).map{ case(a, b) => a.insert(b)}
But I got an error: Can't zip RDDs with unequal numbers of partitions
I can see in the RDD api documentation that there is a method called zipPartitions(). Is it possible, and if so how, to use this method to zip each partition from RDD[T] with a single and only partition of RDD[U] and perform a map on it as I tried above?
Something like this should work:
val zippedFirstRDD = firstRDD.zipWithIndex.map(_.swap)
val zippedSecondRDD = secondRDD.zipWithIndex.map(_.swap)
zippedFirstRDD.join(zippedSecondRDD)
.map{case (key, (valueU, valueT)) => {
valueU.insert(valueT)
}}

reduceByKey: How does it work internally?

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The map function is clear: s is the key and it points to the line from data.txt and 1 is the value.
However, I didn't get how the reduceByKey works internally? Does "a" points to the key? Alternatively, does "a" point to "s"? Then what does represent a + b? how are they filled?
Let's break it down to discrete methods and types. That usually exposes the intricacies for new devs:
pairs.reduceByKey((a, b) => a + b)
becomes
pairs.reduceByKey((a: Int, b: Int) => a + b)
and renaming the variables makes it a little more explicit
pairs.reduceByKey((accumulatedValue: Int, currentValue: Int) => accumulatedValue + currentValue)
So, we can now see that we are simply taking an accumulated value for the given key and summing it with the next value of that key. NOW, let's break it further so we can understand the key part. So, let's visualize the method more like this:
pairs.reduce((accumulatedValue: List[(String, Int)], currentValue: (String, Int)) => {
//Turn the accumulated value into a true key->value mapping
val accumAsMap = accumulatedValue.toMap
//Try to get the key's current value if we've already encountered it
accumAsMap.get(currentValue._1) match {
//If we have encountered it, then add the new value to the existing value and overwrite the old
case Some(value : Int) => (accumAsMap + (currentValue._1 -> (value + currentValue._2))).toList
//If we have NOT encountered it, then simply add it to the list
case None => currentValue :: accumulatedValue
}
})
So, you can see that the reduceByKey takes the boilerplate of finding the key and tracking it so that you don't have to worry about managing that part.
Deeper, truer if you want
All that being said, that is a simplified version of what happens as there are some optimizations that are done here. This operation is associative, so the spark engine will perform these reductions locally first (often termed map-side reduce) and then once again at the driver. This saves network traffic; instead of sending all the data and performing the operation, it can reduce it as small as it can and then send that reduction over the wire.
One requirement for the reduceByKey function is that is must be associative. To build some intuition on how reduceByKey works, let's first see how an associative associative function helps us in a parallel computation:
As we can see, we can break an original collection in pieces and by applying the associative function, we can accumulate a total. The sequential case is trivial, we are used to it: 1+2+3+4+5+6+7+8+9+10.
Associativity lets us use that same function in sequence and in parallel. reduceByKey uses that property to compute a result out of an RDD, which is a distributed collection consisting of partitions.
Consider the following example:
// collection of the form ("key",1),("key,2),...,("key",20) split among 4 partitions
val rdd =sparkContext.parallelize(( (1 to 20).map(x=>("key",x))), 4)
rdd.reduceByKey(_ + _)
rdd.collect()
> Array[(String, Int)] = Array((key,210))
In spark, data is distributed into partitions. For the next illustration, (4) partitions are to the left, enclosed in thin lines. First, we apply the function locally to each partition, sequentially in the partition, but we run all 4 partitions in parallel. Then, the result of each local computation are aggregated by applying the same function again and finally come to a result.
reduceByKey is an specialization of aggregateByKey aggregateByKey takes 2 functions: one that is applied to each partition (sequentially) and one that is applied among the results of each partition (in parallel). reduceByKey uses the same associative function on both cases: to do a sequential computing on each partition and then combine those results in a final result as we have illustrated here.
In your example of
val counts = pairs.reduceByKey((a,b) => a+b)
a and b are both Int accumulators for _2 of the tuples in pairs. reduceKey will take two tuples with the same value s and use their _2 values as a and b, producing a new Tuple[String,Int]. This operation is repeated until there is only one tuple for each key s.
Unlike non-Spark (or, really, non-parallel) reduceByKey where the first element is always the accumulator and the second a value, reduceByKey operates in a distributed fashion, i.e. each node will reduce it's set of tuples into a collection of uniquely-keyed tuples and then reduce the tuples from multiple nodes until there is a final uniquely-keyed set of tuples. This means as the results from nodes are reduced, a and b represent already reduced accumulators.
Spark RDD reduceByKey function merges the values for each key using an associative reduce function.
The reduceByKey function works only on the RDDs and this is a transformation operation that means it is lazily evaluated. And an associative function is passed as a parameter, which is applied to source RDD and creates a new RDD as a result.
So in your example, rdd pairs has a set of multiple paired elements like (s1,1), (s2,1) etc. And reduceByKey accepts a function (accumulator, n) => (accumulator + n), which initialise the accumulator variable to default value 0 and adds up the element for each key and return the result rdd counts having the total counts paired with key.
Simple if your input RDD data look like this:
(aa,1)
(bb,1)
(aa,1)
(cc,1)
(bb,1)
and if you apply reduceByKey on above rdd data then few you have to remember,
reduceByKey always takes 2 input (x,y) and always works with two rows at a time.
As it is reduceByKey it will combine two rows of same key and combine the result of value.
val rdd2 = rdd.reduceByKey((x,y) => x+y)
rdd2.foreach(println)
output:
(aa,2)
(bb,2)
(cc,1)