Spark Streaming: Aggregating values across batches - scala

We've a Spark Streaming job that calculates some values in each batch. What we need to do now is aggregate values across ALL batches. What is the best strategy to do this in Spark Streaming. Should we use 'Spark Accumulators' for this?

Related

Measure time taken for repartitioning of spark structured streaming dataframe

I want to do a performance comparison in which I want to measure the time taken in repartitioning of spark structured streaming dataframe.
for eg time taken for the following step
df.repartition($"colname")
Any ideas how can this be achieved ?

Ingesting unique records in Kafka-Spark Streaming

I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year # 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.

Creating Global Dataframe for Spark Streaming job

I have a spark streaming job that continuously receiving messages from kafka. The streaming job will:
Do filtering on the newly received message
Append the filtering left messages to the global dataframe.
When the global dataframe received 1000 rows of records, do a sum operation.
My question are:
How to create the global dataframe? is it just simply creating a dataframe outside, before the loop of
directKafkaStream.foreachRDD { .... }
How to effectively handle the operation of global dataframe, the step 3 in this task. Do I have to embed the operation into the foreachRDD loop?

Connecting Spark streaming to streamsets input

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.
The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.

Apache spark + RDD + persist() doubts

I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance