Spark Aggregatebykey partitioner order - scala

If I apply a hash partitioner to Spark's aggregatebykey function, i.e. myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)
Does myRDD get repartitioned first before it's key/value pairs are aggregated using combOp and mergeOp? Or does myRDD go through combOp and mergeOp first and the resulting RDD is repartitioned using the HashPartitioner?

aggregateByKey applies map side aggregation before eventual shuffle. Since every partition is processed sequentially the only operation that is applied in this phase is initialization (creating zeroValue) and combOp. A goal of mergeOp is to combine aggregation buffers so it is not used before shuffle.
If input RDD is a ShuffledRDD with the same partitioner as requested for aggregateByKey then data is not shuffled at all and data is aggregated locally using mapPartitions.

Related

RDD persist mechanism (what happen when I persist a RDD and then use take(10) not count() )

what happens when I persist a RDD and then use take(10) instead of count().
I have read some comments, it says that if I use take() instead of count, it might only persist partial partition not all the partitions.
But, if my dataset is big enough,then using count is very time consuming.
Is there any other action operator that I can use to trigger persist to persist all partition.
foreachPartition is an action operator and it need data from all partitions, can I use this after persist?
need your help ~
Ex:
val rdd1 = sc.textFile("src/main/resources/").persist()
rdd1.foreachPartition(partition=>partition.take(1))

How RDDs are created in Structured streaming in spark?

How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
#transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

Split an RDD into multiple RDDS

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.
htmlRDD = [key1,html
key2,html
key3,html
key4,html
........]
Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator.
I'm doing this in Scala.
htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])
You can also try this in place of breaking RDD:
htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");
I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.
Spark saves each RDD partition into 1 hdfs file partition. So to achieve good parallelism your source RDD should have many partitions(actually depends on size of whole data). So I think you want to split your RDD not into several RDDs, but rather to have RDD with many partitions.
You you can do it with repartition() or coallesce()

When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.
I want to know is spark smart enough to do that for me or I have to implement this logic myself?
I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.
If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.
Nevertheless, the performance should be better as if both RDDs would have different partitioner.
I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;
Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a
shuffle, such as aggregateByKey or reduceByKey, you can prevent the
shuffle by adding a hash partitioner with the same number of
partitions as an explicit argument to the first operation and
persisting the RDD before the join.

Apache Spark lookup function

Reading def of lookup method from https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.PairRDDFunctions :
def
lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
How can ensure that the RDD has a known partitioner ? I understand that an RDD is partitioned across node's in a cluster but what is meant by statement only searching the partition that the key maps to. ?
A number of operations (especially on key-value pairs) automatically set up a partition when they are executed as it can increase efficiency by cutting down on network traffic. For example (From PairRDDFunctions):
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
}
Note the creation of a HashPartitioner. You can check the partitioner of your RDD if you want to see if it has one. You can also set one via partitionBy
A Partitioner maps keys to partition indexes. If a key-value RDD is partitioned by a Partitioner, it means that each key is placed in the partition that is assigned to it be the Partitioner.
This is great for lookup! You can use the Partitioner to tell you the partition that this key belongs to, and then you only need to look at that partition of the RDD. (This can mean that the rest of the RDD does not even need to be computed!)
How can ensure that the RDD has a known partitioner ?
You can check that rdd.partitioner is not None. (Operations that need to locate keys, like groupByKey and join, partition the RDD for you.) You can use rdd.partitionBy to assign your own Partitioner and re-shuffle the RDD by it.
Each RDD can ,optionally, define a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned).
Indeed, in some pairRDDFunctions you can specify the partitioner, frequently in last parameter.
Or if your RDD hasn't partitioner, can use partitionBy method to set it.
Lookup method go directly partition if your RDD already has a partitioner or scan all the partitions in parallel if hasn't.