Writing RDD to elasticsearch taking time in Spark streaming scala - scala

I developed spark streaming(reeiver approach) which is reading data from kafka and processing data and writing into elasticsearch.
same code was developed in java(now we are writing same code in spark scala) and when we are comparing with java performance, spark is not doing well.
What I have observed is when we are writing to ES, its taking time.
Here is my highlevel code:
val kafkaStreams: util.List[DStream[String]] = new util.ArrayList[DStream[String]]
for(i <- 0 until topic_threads){
var data = KafkaUtils.createStream(ssc,kafkaConf,topic).map(line => line._2)
kafkaStreams.add(data)
}
//below union improves the performance as per spark 1.6.2
documentation
val unifiedStream = ssc.union(kafkaStreams)
unifiedStream.persist(StorageLevel.MEMORY_ONLY)
if(flagY){
val dataES = unifiedStream.map(rdd => processData(rdd))
dataES.foreachRDD(rdd => {
ElasticUtils.saveToEs(rdd, index_Name, index_Type)
})
In processData method, I am just parsing the data which we have red from kafka.
Could anyone please let me know your experiene or suggestions to improve the spark steaming(scala) receiver approach performance.
Due to this low performance, batches are piling up and its increasing delay in batch scheduling..

Related

Spark streaming slow down when using large broadcast objects in UDF

I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()

Spark sliding window performance

I've setup a pipeline for incoming events from a stream in Apache Kafka.
Spark connects to Kafka, get the stream from a topic and process some "simple" aggregation tasks.
As I'm trying to build a service that should have a low latency refresh (below 1 second) I've built a simple Spark streaming app in Scala.
val windowing = events.window(Seconds(30), Seconds(1))
val spark = SparkSession
.builder()
.appName("Main Processor")
.getOrCreate()
import spark.implicits._
// Go into RDD of DStream
windowing.foreachRDD(rdd => {
// Convert RDD of JSON into DataFrame
val df = spark.read.json(rdd)
// Process only if received DataFrame is not empty
if (!df.head(1).isEmpty) {
// Create a view for Spark SQL
val rdf = df.select("user_id", "page_url")
rdf.createOrReplaceTempView("currentView")
val countDF = spark.sql("select count(distinct user_id) as sessions from currentView")
countDF.show()
}
It works as expected. My concerns are about performance at this point. Spark is running on a 4 CPUs Ubuntu server for testing purpose.
The CPU usage is about 35% all the time. I'm wondering if the incomming data from the stream have let's say 500 msg/s how would the CPU usage will evolve? Will it grow exp. or in a linear way?
If you can share your experience with Apache Spark in that kind of situation I'd appreciate it.
Last open question is if I set the sliding window interval to 500ms (as I'd like) will this blow up? I mean, it seems that Spark streaming features are fresh and the batch processing architecture may be a limitation in really real time data processing, isn't it?

How to count number of rows per window in streaming queries?

Scenario: Working on Spark Streaming in Structured SQL. I have to implement a "info" dataset about how many rows I've processed in the last "window".
A little bit of code.
val invalidData: Dataset[String] =
parsedData.filter(record => !record.isValid).map(record => record.rawInput)
val validData: Dataset[FlatOutput] = parsedData
.filter(record => record.isValid)
I have two Dataset. But since I'm working on Streaming I cannot perform a .count (Error raised: Queries with streaming sources must be executed with writeStream.start())
val infoDataset = validData
.select(count("*") as "valid")
but a new error occurs: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark and I don't want to set outputMode as complete since I don't want total count from beginning, but just last "windowed" batch.
Unfortunately I don't have any column that I could register as watermark for these datasets.
Is there a way to know how many rows are processed in each iteration?
It seems like StreamingQueryStatus could be of some help.

What is LocalTableScan in Spark Structured Streaming for?

does anyone know to what corresponds LocalTableScan in Spark Structured Streaming?
I'm trying to understand a strange behavior that I observed in my Spark structure streaming application that is running in local[*] mode.
I have 8 core on my machines. While the majority of my Batches have 8 partitions, every once in a while I get 16 or 32 or 56 and so on partitions/Tasks. I notice that it is always a multiple of 8. I have notice in opening the stage tab, that when it happens, it is because there is multiple LocalTableScan.
That is if I have 2 LocalTableScan then the mini-batch job, will have 16 task/partition and so on.
To give a bit of context because I am suspecting that it might come from it, I am using a MemoryStream.
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
val rdf = df.mapPartitions{ it => {.....}}(RowEncoder.apply(StructType(List(StructField("blob", StringType, false)))))
I have a future that feeds my memory stream as such right after:
Future {
blocking {
for (i <- 1 to 100000) {
rows.addData(maps)
Thread.sleep(3000)
}
}
}
and then my query:
rdf.writeStream.
trigger(Trigger.ProcessingTime("1 seconds"))
.format("console").outputMode("append")
.queryName("SourceConvertor1").start().awaitTermination()
Please, any suggestions? Hints ?
It indicates in memory on the Driver. As your code shows.

Spark streaming multiple sources, reload dataframe

I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.
I can load the postgres table with something like:
val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
"url" -> url,
"dbtable" -> query))
...
val broadcasted = sc.broadcast(data.collect())
And later I can cross it like this:
val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}
I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/
In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.
Compared to traditional stateful streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.
The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.
One workaround is to postpone the query to output operations for example: foreachRDD(func).