Dropping empty DataFrame partitions in Apache Spark - scala

I try to repartition a DataFrame according to a column the the DataFrame has N (let say N=3) different values in the partition-column x, e.g:
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x") // create dummy data
What I like to achieve is to repartiton myDF by x without producing empty partitions. Is there a better way than doing this?
val numParts = myDF.select($"x").distinct().count.toInt
myDF.repartition(numParts,$"x")
(If I don't specify numParts in repartiton, most of my partitions are empty (as repartition creates 200 partitions) ...)

I'd think of solution with iterating over df partition and fetching record count in it to find non-empty partitions.
val nonEmptyPart = sparkContext.longAccumulator("nonEmptyPart")
df.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPart.add(1))
As we got non-empty partitions (nonEmptyPart), we can clean empty partitions by using coalesce() (check coalesce() vs repartition()).
val finalDf = df.coalesce(nonEmptyPart.value.toInt) //coalesce() accepts only Int type
It may or may not be the best, but this solution will avoid shuffling as we are not using repartition()
Example to address comment
val df1 = sc.parallelize(Seq(1, 1, 2, 2, 3, 3)).toDF("x").repartition($"x")
val nonEmptyPart = sc.longAccumulator("nonEmptyPart")
df1.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPart.add(1))
val finalDf = df1.coalesce(nonEmptyPart.value.toInt)
println(s"nonEmptyPart => ${nonEmptyPart.value.toInt}")
println(s"df.rdd.partitions.length => ${df1.rdd.partitions.length}")
println(s"finalDf.rdd.partitions.length => ${finalDf.rdd.partitions.length}")
Output
nonEmptyPart => 3
df.rdd.partitions.length => 200
finalDf.rdd.partitions.length => 3

Related

How can I deal with each adjoin two element difference greater than threshold from Spark RDD

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
[2,3,5,8,19,3,5,89,20,17]
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
[19,89]
How can I do this with scala from RDD?
If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.sliding(2)
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
Doing
finalrdd.foreach(println)
should print
19
89
You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

Coalescing has no effect on number of partitions in spark

Here is the code:
val nouns = sc.textFile("/Users/kaiyin/IdeaProjects/learnSpark/src/main/resources/nouns")
val verbs = sc.textFile("/Users/kaiyin/IdeaProjects/learnSpark/src/main/resources/verbs")
val sentences = nouns.cartesian(verbs).take(10)
sentences.foreach(println _)
println(s"N partitions for nouns: ${nouns.partitions.size}")
nouns.coalesce(10, true)
println(s"N partitions for nouns after coalesce: ${nouns.partitions.size}")
Result:
N partitions for nouns: 2
N partitions for nouns after coalesce: 2
From spark 1.6.2 doc:
Note: With shuffle = true, you can actually coalesce to a larger
number of partitions. This is useful if you have a small number of
partitions, say 100, potentially with a few partitions being
abnormally large. Calling coalesce(1000, shuffle = true) will result
in 1000 partitions with the data distributed using a hash partitioner.
But apparently coalesce has not effect at all in this case. Why?
Whole script is here: https://github.com/kindlychung/learnSpark/blob/master/src/main/scala/RDDDemo.scala
coalesce doesn't modify RDD in place but returns a new RDD. Since you check number of partitions of the input RDD this an expected output.
val rdd = sc.parallelize(1 to 100, 10)
val coalesced = rdd.coalesce(200, true)
coalesced.partitions.size
// Int = 200

How to get a subset of a RDD?

I am new to Spark. If I have a RDD consists of key-value pairs, what is the efficient way to return a subset of this RDD containing the keys that appear more than a certain times in the original RDD?
For example, if my original data RDD is like this:
val dataRDD=sc.parallelize(List((1,34),(5,3),(1,64),(3,67),(5,0)),3)
I want to get a new RDD, in which the keys appear more than once in dataRDD. The newRDD should contains these tuples: (1,34),(5,3),(1,64),(5,0). How can I get this new RDD? Thank you very much.
Count keys and filter infrequent:
val counts = dataRDD.keys.map((_, 1)).reduceByKey(_ + _)
val infrequent = counts.filter(_._2 == 1)
If number of infrequent values is to large to be handled in memory you can use PairRDDFunctions.subtractByKey:
dataRDD.subtractByKey(infrequent)
otherwise a broadcast variable:
val infrequentKeysBd = sc.broadcast(infrequent.keys.collect.toSet)
dataRDD.filter{ case(k, _) => !infrequentKeysBd.value.contains(k)}
If number of frequent keys is very low you can filter frequent keys and use a broadcast variable as above:
val frequent = counts.filter(_._2 > 1)
val frequentKeysBd = ??? // As before
dataRDD.filter{case(k, _) => frequentKeysBd.value.contains(k)}

spark: how to zip an RDD with each partition of the other RDD

Let's say I have one RDD[U] that will always consist of only 1 partition. My task is fill this RDD up with the contents of another RDD[T] that resides over n number of partitions. The final output should be n number of partitions of RDD[U].
What I tried to do originally is:
val newRDD = firstRDD.zip(secondRDD).map{ case(a, b) => a.insert(b)}
But I got an error: Can't zip RDDs with unequal numbers of partitions
I can see in the RDD api documentation that there is a method called zipPartitions(). Is it possible, and if so how, to use this method to zip each partition from RDD[T] with a single and only partition of RDD[U] and perform a map on it as I tried above?
Something like this should work:
val zippedFirstRDD = firstRDD.zipWithIndex.map(_.swap)
val zippedSecondRDD = secondRDD.zipWithIndex.map(_.swap)
zippedFirstRDD.join(zippedSecondRDD)
.map{case (key, (valueU, valueT)) => {
valueU.insert(valueT)
}}