Creating Global Dataframe for Spark Streaming job - scala

I have a spark streaming job that continuously receiving messages from kafka. The streaming job will:
Do filtering on the newly received message
Append the filtering left messages to the global dataframe.
When the global dataframe received 1000 rows of records, do a sum operation.
My question are:
How to create the global dataframe? is it just simply creating a dataframe outside, before the loop of
directKafkaStream.foreachRDD { .... }
How to effectively handle the operation of global dataframe, the step 3 in this task. Do I have to embed the operation into the foreachRDD loop?

Related

Pyspark - Job getting bogged down when writing dataframe to kafka

I have a dataframe that I'm transforming. I want to write this in the kafka. But Job getting bogged down when writing dataframe to kafka. There are close to ~400 gb of shuffle reads.
how can i fix this?
last clogged code ;
yarn stuck job ;

Spark Streaming: Aggregating values across batches

We've a Spark Streaming job that calculates some values in each batch. What we need to do now is aggregate values across ALL batches. What is the best strategy to do this in Spark Streaming. Should we use 'Spark Accumulators' for this?

How RDDs are created in Structured streaming in spark?

How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
#transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

How to convert DStream of number of RDDs to Single RDD

Basically i am consuming data from multiple kafka topics using single Spark Streaming consumer[Direct Approach].
val dStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet).map(_._2)
Batch interval is 30 Seconds.
I got couple of questions here.
Will the DStream contains multiple RDD's instead of Single RDD when i call foreachRDD on DStream? will each topic create separate RDD??
If yes,i want to union all the RDDs to single RDD , then process the data. How do i do that?
If my processing time is more than batch interval, will the DStream contain more than one RDDs?
I tried to union DStream RDDs to single RDD using the below way. First of all is my understanding correct? If the DStream always returns single RDD, then the below code is not necessary.
Sample Code:
var dStreamRDDList = new ListBuffer[RDD[String]]
dStream.foreachRDD(rdd =>
{
dStreamRDDList += rdd
})
val joinedRDD = ssc.sparkContext.union(dStreamRDDList).cache()
//THEN PROCESS USING joinedRDD
//Convert joinedRDD to DF, then apply aggregate operations using DF API.
Will the DStream contains multiple RDD's instead of Single RDD when i call foreachRDD on DStream? will each topic create separate RDD?
No. Even though you have multiple topics, you'll have a single RDD at any given batch interval.
If my processing time is more than batch interval, will the DStream contain more than one RDDs?
No, if your processing time is longer than batch interval, all that will be done is reading off the topic offsets. Processing of the next batch will only begin once the previous job has finished.
As a side note, make sure you actually need to use foreachRDD, or if perhaps you're misusing the DStream API (disclaimer: I am the author of that post)

Apache spark + RDD + persist() doubts

I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance