Coalescing has no effect on number of partitions in spark - scala

Here is the code:
val nouns = sc.textFile("/Users/kaiyin/IdeaProjects/learnSpark/src/main/resources/nouns")
val verbs = sc.textFile("/Users/kaiyin/IdeaProjects/learnSpark/src/main/resources/verbs")
val sentences = nouns.cartesian(verbs).take(10)
sentences.foreach(println _)
println(s"N partitions for nouns: ${nouns.partitions.size}")
nouns.coalesce(10, true)
println(s"N partitions for nouns after coalesce: ${nouns.partitions.size}")
Result:
N partitions for nouns: 2
N partitions for nouns after coalesce: 2
From spark 1.6.2 doc:
Note: With shuffle = true, you can actually coalesce to a larger
number of partitions. This is useful if you have a small number of
partitions, say 100, potentially with a few partitions being
abnormally large. Calling coalesce(1000, shuffle = true) will result
in 1000 partitions with the data distributed using a hash partitioner.
But apparently coalesce has not effect at all in this case. Why?
Whole script is here: https://github.com/kindlychung/learnSpark/blob/master/src/main/scala/RDDDemo.scala

coalesce doesn't modify RDD in place but returns a new RDD. Since you check number of partitions of the input RDD this an expected output.
val rdd = sc.parallelize(1 to 100, 10)
val coalesced = rdd.coalesce(200, true)
coalesced.partitions.size
// Int = 200

Related

Spark DataFrame: Is order of withColumn guaranteed?

Given the following code:
dataFrame
.withColumn("A", myUdf1($"x")) // withColumn1 from x
.withColumn("B", myUdf2($"y")) // withColumn2 from y
Is it guaranteed that withColumn1 will execute before withColumn2?
A better example:
dataFrame
.withColumn("A", myUdf1($"x")) // withColumn1 from x
.withColumn("B", myUdf2($"A")) // withColumn2 from A!!
Note that withColumn2 operates on A that is calculated from withColumn1.
I'm asking because I'm having inconsistent results over multiple runs of the same code and I started to think that this could be the source of the issue.
EDIT: Added more detailed code sample
val result = dataFrame
.groupBy("key")
.agg(
collect_list($"itemList").as("A"), // all items
collect_list(when($"click".isNotNull, $"itemList")).as("B") // subset of A
)
// create sparse item vector from all list of items A
.withColumn("vectorA", aggToSparseUdf($"A"))
// create sparse item vector from all list of items B (subset of A)
.withColumn("vectorB", aggToSparseUdf($"B"))
// calculate ratio vector B / A
.withColumn("ratio", divideVectors($"vectorB", $"vectorA"))
val keys: Seq[String] = result.head.getAs[Seq[String]]("key")
val values: Seq[SparseVector] = result.head.getAs[Seq[SparseVector]]("ratio")
It IS guaranteed that for each specific record in dataFrame, myUdf1 will be applied before myUdf2; However:
It is NOT guaranteed that myUdf1 will be applied to all records of dataFrame before myUdf2 is applied to any record - in other words, myUdf2 might be applied to some records before myUdf1 has been applied to other records
This is true because Spark would likely combine both operations together into a single stage, and execute this stage (applying myUdf1 and myUdf2) on each record of each partition.
This shouldn't pose any problem if your UDFs are "purely functional", or "idempotent", or cause no side effects - and they should be, because Spark assumes all transformations are such. If they weren't, Spark wouldn't be able to optimize execution by "combining" transformations together, running them in parallel on different partitions, retrying transformations etc.
EDIT: if you want to force UDF1 to be completely applied before applying UDF2 to any record, you'd have to force them into separate stages - this can be done, for example, by repartitioning the DataFrame:
// sample data:
val dataFrame = Seq("A", "V", "D").toDF("x")
// two UDFs with "side effects" (printing to console):
val myUdf1 = udf[String, String](x => {
println("In UDF1")
x.toLowerCase
})
val myUdf2 = udf[String, String](x => {
println("In UDF2")
x.toUpperCase
})
// repartitioning between UDFs
dataFrame
.withColumn("A", myUdf1($"x"))
.repartition(dataFrame.rdd.partitions.length + 1)
.withColumn("B", myUdf2($"A"))
.show()
// prints:
// In UDF1
// In UDF1
// In UDF1
// In UDF2
// In UDF2
// In UDF2
NOTE that this isn't bullet-proof either - if, for example, there are failures and retries, order can be once again non deterministic.

Dropping empty DataFrame partitions in Apache Spark

I try to repartition a DataFrame according to a column the the DataFrame has N (let say N=3) different values in the partition-column x, e.g:
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x") // create dummy data
What I like to achieve is to repartiton myDF by x without producing empty partitions. Is there a better way than doing this?
val numParts = myDF.select($"x").distinct().count.toInt
myDF.repartition(numParts,$"x")
(If I don't specify numParts in repartiton, most of my partitions are empty (as repartition creates 200 partitions) ...)
I'd think of solution with iterating over df partition and fetching record count in it to find non-empty partitions.
val nonEmptyPart = sparkContext.longAccumulator("nonEmptyPart")
df.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPart.add(1))
As we got non-empty partitions (nonEmptyPart), we can clean empty partitions by using coalesce() (check coalesce() vs repartition()).
val finalDf = df.coalesce(nonEmptyPart.value.toInt) //coalesce() accepts only Int type
It may or may not be the best, but this solution will avoid shuffling as we are not using repartition()
Example to address comment
val df1 = sc.parallelize(Seq(1, 1, 2, 2, 3, 3)).toDF("x").repartition($"x")
val nonEmptyPart = sc.longAccumulator("nonEmptyPart")
df1.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPart.add(1))
val finalDf = df1.coalesce(nonEmptyPart.value.toInt)
println(s"nonEmptyPart => ${nonEmptyPart.value.toInt}")
println(s"df.rdd.partitions.length => ${df1.rdd.partitions.length}")
println(s"finalDf.rdd.partitions.length => ${finalDf.rdd.partitions.length}")
Output
nonEmptyPart => 3
df.rdd.partitions.length => 200
finalDf.rdd.partitions.length => 3

Can only zip RDDs with same number of elements in each partition despite repartition

I load a dataset
val data = sc.textFile("/home/kybe/Documents/datasets/img.csv",defp)
I want to put an index on this data thus
val nb = data.count.toInt
val tozip = sc.parallelize(1 to nb).repartition(data.getNumPartitions)
val res = tozip.zip(data)
Unfortunately i have the following error
Can only zip RDDs with same number of elements in each partition
How can i modify the number of element by partition if it is possible ?
Why it doesn't work?
The documentation for zip() states:
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
So we need to make sure we meet 2 conditions:
both RDDs have the same number of partitions
respective partitions in those RDDs have exactly the same size
You are making sure that you will have the same number of partitions with repartition() but Spark doesn't guarantee that you will have the same distribution in each partition for each RDD.
Why is that?
Because there are different types of RDDs and most of them have different partitioning strategies! For example:
ParallelCollectionRDD is created when you parallelise a collection with sc.parallelize(collection) it will see how many partitions there should be, will check the size of the collection and calculate the step size. I.e. you have 15 elements in the list and want 4 partitions, first 3 will have 4 consecutive elements last one will have the remaining 3.
HadoopRDD if I remember correctly, one partition per file block. Even though you are using a local file internally Spark first creates a this kind of RDD when you read a local file and then maps that RDD since that RDD is a pair RDD of <Long, Text> and you just want String :-)
etc.etc.
In your example Spark internally does create different types of RDDs (CoalescedRDD and ShuffledRDD) while doing the repartitioning but I think you got the global idea that different RDDs have different partitioning strategies :-)
Notice that the last part of the zip() doc mentions the map() operation. This operation does not repartition as it's a narrow transformation data so it would guarantee both conditions.
Solution
In this simple example as it was mentioned you can do simply data.zipWithIndex. If you need something more complicated then creating the new RDD for zip() should be created with map() as mentioned above.
I solved this by creating an implicit helper like so
implicit class RichContext[T](rdd: RDD[T]) {
def zipShuffle[A](other: RDD[A])(implicit kt: ClassTag[T], vt: ClassTag[A]): RDD[(T, A)] = {
val otherKeyd: RDD[(Long, A)] = other.zipWithIndex().map { case (n, i) => i -> n }
val thisKeyed: RDD[(Long, T)] = rdd.zipWithIndex().map { case (n, i) => i -> n }
val joined = new PairRDDFunctions(thisKeyed).join(otherKeyd).map(_._2)
joined
}
}
Which can then be used like
val rdd1 = sc.parallelize(Seq(1,2,3))
val rdd2 = sc.parallelize(Seq(2,4,6))
val zipped = rdd1.zipShuffle(rdd2) // Seq((1,2),(2,4),(3,6))
NB: Keep in mind that the join will cause a shuffle.
The following provides a Python answer to this problem by defining a custom_zip method:
Can only zip with RDD which has the same number of partitions error

spark: how to zip an RDD with each partition of the other RDD

Let's say I have one RDD[U] that will always consist of only 1 partition. My task is fill this RDD up with the contents of another RDD[T] that resides over n number of partitions. The final output should be n number of partitions of RDD[U].
What I tried to do originally is:
val newRDD = firstRDD.zip(secondRDD).map{ case(a, b) => a.insert(b)}
But I got an error: Can't zip RDDs with unequal numbers of partitions
I can see in the RDD api documentation that there is a method called zipPartitions(). Is it possible, and if so how, to use this method to zip each partition from RDD[T] with a single and only partition of RDD[U] and perform a map on it as I tried above?
Something like this should work:
val zippedFirstRDD = firstRDD.zipWithIndex.map(_.swap)
val zippedSecondRDD = secondRDD.zipWithIndex.map(_.swap)
zippedFirstRDD.join(zippedSecondRDD)
.map{case (key, (valueU, valueT)) => {
valueU.insert(valueT)
}}

How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data is like
(1,1), (1,2), (1,3), (2,1), (2,2), (2,3)
So partitioner would put this into different partitions with same keys falling in the same partition. However I do not understand the significance of the constructor argument
new HashPartitoner(numPartitions) //What does numPartitions do?
For the above dataset how would the results differ if I did
new HashPartitoner(1)
new HashPartitoner(2)
new HashPartitoner(10)
So how does HashPartitioner work actually?
Well, lets make your dataset marginally more interesting:
val rdd = sc.parallelize(for {
x <- 1 to 3
y <- 1 to 2
} yield (x, None), 8)
We have six elements:
rdd.count
Long = 6
no partitioner:
rdd.partitioner
Option[org.apache.spark.Partitioner] = None
and eight partitions:
rdd.partitions.length
Int = 8
Now lets define small helper to count number of elements per partition:
import org.apache.spark.rdd.RDD
def countByPartition(rdd: RDD[(Int, None.type)]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
Since we don't have partitioner our dataset is distributed uniformly between partitions (Default Partitioning Scheme in Spark):
countByPartition(rdd).collect()
Array[Int] = Array(0, 1, 1, 1, 0, 1, 1, 1)
Now lets repartition our dataset:
import org.apache.spark.HashPartitioner
val rddOneP = rdd.partitionBy(new HashPartitioner(1))
Since parameter passed to HashPartitioner defines number of partitions we have expect one partition:
rddOneP.partitions.length
Int = 1
Since we have only one partition it contains all elements:
countByPartition(rddOneP).collect
Array[Int] = Array(6)
Note that the order of values after the shuffle is non-deterministic.
Same way if we use HashPartitioner(2)
val rddTwoP = rdd.partitionBy(new HashPartitioner(2))
we'll get 2 partitions:
rddTwoP.partitions.length
Int = 2
Since rdd is partitioned by key data won't be distributed uniformly anymore:
countByPartition(rddTwoP).collect()
Array[Int] = Array(2, 4)
Because with have three keys and only two different values of hashCode mod numPartitions there is nothing unexpected here:
(1 to 3).map((k: Int) => (k, k.hashCode, k.hashCode % 2))
scala.collection.immutable.IndexedSeq[(Int, Int, Int)] = Vector((1,1,1), (2,2,0), (3,3,1))
Just to confirm the above:
rddTwoP.mapPartitions(iter => Iterator(iter.map(_._1).toSet)).collect()
Array[scala.collection.immutable.Set[Int]] = Array(Set(2), Set(1, 3))
Finally with HashPartitioner(7) we get seven partitions, three non-empty with 2 elements each:
val rddSevenP = rdd.partitionBy(new HashPartitioner(7))
rddSevenP.partitions.length
Int = 7
countByPartition(rddTenP).collect()
Array[Int] = Array(0, 2, 2, 2, 0, 0, 0)
Summary and Notes
HashPartitioner takes a single argument which defines number of partitions
values are assigned to partitions using hash of keys. hash function may differ depending on the language (Scala RDD may use hashCode, DataSets use MurmurHash 3, PySpark, portable_hash).
In simple case like this, where key is a small integer, you can assume that hash is an identity (i = hash(i)).
Scala API uses nonNegativeMod to determine partition based on computed hash,
if distribution of keys is not uniform you can end up in situations when part of your cluster is idle
keys have to be hashable. You can check my answer for A list as a key for PySpark's reduceByKey to read about PySpark specific issues. Another possible problem is highlighted by HashPartitioner documentation:
Java arrays have hashCodes that are based on the arrays' identities rather than their contents, so attempting to partition an RDD[Array[]] or RDD[(Array[], _)] using a HashPartitioner will produce an unexpected or incorrect result.
In Python 3 you have to make sure that hashing is consistent. See What does Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED mean in pyspark?
Hash partitioner is neither injective nor surjective. Multiple keys can be assigned to a single partition and some partitions can remain empty.
Please note that currently hash based methods don't work in Scala when combined with REPL defined case classes (Case class equality in Apache Spark).
HashPartitioner (or any other Partitioner) shuffles the data. Unless partitioning is reused between multiple operations it doesn't reduce amount of data to be shuffled.
RDD is distributed this means it is split on some number of parts. Each of this partitions is potentially on different machine. Hash partitioner with argument numPartitions chooses on what partition to place pair (key, value) in following way:
Creates exactly numPartitions partitions.
Places (key, value) in partition with number Hash(key) % numPartitions
The HashPartitioner.getPartition method takes a key as its argument and returns the index of the partition which the key belongs to. The partitioner has to know what the valid indices are, so it returns numbers in the right range. The number of partitions is specified through the numPartitions constructor argument.
The implementation returns roughly key.hashCode() % numPartitions. See Partitioner.scala for more details.