How RDDs are created in Structured streaming in spark? - spark-structured-streaming

How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?

Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
#transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

Related

Creating Global Dataframe for Spark Streaming job

I have a spark streaming job that continuously receiving messages from kafka. The streaming job will:
Do filtering on the newly received message
Append the filtering left messages to the global dataframe.
When the global dataframe received 1000 rows of records, do a sum operation.
My question are:
How to create the global dataframe? is it just simply creating a dataframe outside, before the loop of
directKafkaStream.foreachRDD { .... }
How to effectively handle the operation of global dataframe, the step 3 in this task. Do I have to embed the operation into the foreachRDD loop?

How to convert DStream of number of RDDs to Single RDD

Basically i am consuming data from multiple kafka topics using single Spark Streaming consumer[Direct Approach].
val dStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet).map(_._2)
Batch interval is 30 Seconds.
I got couple of questions here.
Will the DStream contains multiple RDD's instead of Single RDD when i call foreachRDD on DStream? will each topic create separate RDD??
If yes,i want to union all the RDDs to single RDD , then process the data. How do i do that?
If my processing time is more than batch interval, will the DStream contain more than one RDDs?
I tried to union DStream RDDs to single RDD using the below way. First of all is my understanding correct? If the DStream always returns single RDD, then the below code is not necessary.
Sample Code:
var dStreamRDDList = new ListBuffer[RDD[String]]
dStream.foreachRDD(rdd =>
{
dStreamRDDList += rdd
})
val joinedRDD = ssc.sparkContext.union(dStreamRDDList).cache()
//THEN PROCESS USING joinedRDD
//Convert joinedRDD to DF, then apply aggregate operations using DF API.
Will the DStream contains multiple RDD's instead of Single RDD when i call foreachRDD on DStream? will each topic create separate RDD?
No. Even though you have multiple topics, you'll have a single RDD at any given batch interval.
If my processing time is more than batch interval, will the DStream contain more than one RDDs?
No, if your processing time is longer than batch interval, all that will be done is reading off the topic offsets. Processing of the next batch will only begin once the previous job has finished.
As a side note, make sure you actually need to use foreachRDD, or if perhaps you're misusing the DStream API (disclaimer: I am the author of that post)

Split an RDD into multiple RDDS

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.
htmlRDD = [key1,html
key2,html
key3,html
key4,html
........]
Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator.
I'm doing this in Scala.
htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])
You can also try this in place of breaking RDD:
htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");
I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.
Spark saves each RDD partition into 1 hdfs file partition. So to achieve good parallelism your source RDD should have many partitions(actually depends on size of whole data). So I think you want to split your RDD not into several RDDs, but rather to have RDD with many partitions.
You you can do it with repartition() or coallesce()

Apache spark + RDD + persist() doubts

I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance

Spark Aggregatebykey partitioner order

If I apply a hash partitioner to Spark's aggregatebykey function, i.e. myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)
Does myRDD get repartitioned first before it's key/value pairs are aggregated using combOp and mergeOp? Or does myRDD go through combOp and mergeOp first and the resulting RDD is repartitioned using the HashPartitioner?
aggregateByKey applies map side aggregation before eventual shuffle. Since every partition is processed sequentially the only operation that is applied in this phase is initialization (creating zeroValue) and combOp. A goal of mergeOp is to combine aggregation buffers so it is not used before shuffle.
If input RDD is a ShuffledRDD with the same partitioner as requested for aggregateByKey then data is not shuffled at all and data is aggregated locally using mapPartitions.