Spark streaming slow down when using large broadcast objects in UDF - scala

I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()

Related

Iterative caching vs checkpointing in Spark

I have an iterative application running on Spark that I simplified to the following code:
var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000))
var c: Long = Int.MaxValue
var iteration: Int = 0
while (c > 0) {
iteration += 1
// Manipulate the RDD and cache the new RDD
anRDD = anRDD.zipWithIndex.filter(t => t._2 % 2 == 1).map(_._1).cache() //.localCheckpoint()
// Actually compute the RDD and spawn a new job
c = anRDD.count()
println(s"Iteration: $iteration, Values: $c")
}
What happens to the memory allocation within consequent jobs?
Does the current anRDD "override" the previous ones or are they all kept into memory? In the long run, this can throw some memory exception
Do localCheckpoint and cache have different behaviors? If localCheckpoint is used in place of cache, as localCheckpoint truncates the RDD lineage, then I would expect the previous RDDs to be overridden
Unfortunately seems that Spark is not good for things like that.
Your original implementation is not viable because on each iteration the newer RDD will have an internal reference to the older one so all RDDs pile up in memory.
localCheckpoint is an approximation of what you are trying to achieve. It does truncate RDD's lineage but you lose fault tolerance. It's clearly stated in the documentation for this method.
checkpoint is also an option. It is safe but it would dump the data to hdfs on each iteration.
Consider redesigning the approach. Such hacks could bite sooner or later.
RDDs are immutable so each transformation will return a new RDD.
All anRDD will be kept in memory. See below(running two iteration for your code), id will be different for all the RDDs
So yes, In the long run, this can throw some memory exception. And you
should unpersist rdd after you are done processing on it.
localCheckpoint has different use case than cache. It is used to truncate the lineage of RDD. It doesn't store RDD to disk/local It improves performance but decreases fault tolerance in turn.

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following:
In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this case and just begin computing based on the persisted RDD. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persist()ed. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.
So, I decided to try to see this in action with a simple program (below):
val pairs = spark.sparkContext.parallelize(List((1,2)))
val x = pairs.groupByKey()
x.toDebugString // before collect
x.collect()
x.toDebugString // after collect
spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
x.collect()
x.toDebugString // after checkpoint
I did not see what I expected after reading the above paragraph from the Spark book. I saw the exact same output of toDebugString each time I invoked this method -- each time indicating two stages (where I would have expected only one stage after the checkpoint was supposed to have truncated the lineage.) like this:
scala> x.toDebugString // after collect
res5: String =
(8) ShuffledRDD[1] at groupByKey at <console>:25 []
+-(8) ParallelCollectionRDD[0] at parallelize at <console>:23 []
I am wondering if the key thing that I overlooked might be the word "may", as in the "schedule MAY truncate the lineage". Is this truncation something that might happen given the same program that I wrote above, under other circumstances ? Or is the little program that I wrote not doing the right thing to force the lineage truncation ? Thanks in advance for any insight you can provide !
I think that you should do persist/checkpoint before you do first collect.
From that code for me it looks correct what you get since when spark does first collect it does not know that it should persist or save anything.
Also probably you need to save result of x.persist and then use it...
I propose - try it:
val pairs = spark.sparkContext.parallelize(List((1,2)))
val x = pairs.groupByKey()
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
// **Also maybe do val xx = x.persist(...) and use xx later.**
x.toDebugString // before collect
x.collect()
x.toDebugString // after collect
spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.collect()
x.toDebugString // after checkpoint

Why SPARK repeat transformations after persist operations?

I have next code. I am doing count to perform persist operation and fix transformations above. But I noticed that DAG and stages for 2 different count Jobs calls first persist twice (when I expect second persist method to be called in second count call)
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
.persist(StorageLevel.MEMORY_AND_DISK)
LOG.info(s"First count = " + df.count)
val filter: BaseFilter = new BaseFilter()
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
// Since we will perform count and partitionBy actions, compute all above transformations/ Second persist
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"Second count = $totalPoints")
When you say StorageLevel.MEMORY_AND_DISK spark tries to fit all the data into the memory and if it doesn't fit it spills to disk.
Now you are doing multiple persists here. In spark the memory cache is LRU so the later persists will overwrite the previous cached data.
Even if you specify StorageLevel.MEMORY_AND_DISK when the data is evicted from cache memory by another cached data spark doesn't spill that to the disk. So when you do the next count it needs to revaluate the DAG so that it can retrieve the partitions which aren't present in the cache.
I would suggest you to use StorageLevel.DISK_ONLY to avoid such re-computation.
Here's is the whole scenario.
persist and cache are also the transformation in Spark. After applying any one of the stated transformation, one should use any action in order to cache an RDD or DF to the memory.
Secondly, The unit of cache or persist is "partition". When cache or persist gets executed it will save only those partitions which can be hold in the memory. The remaining partition which cannot be saved on the memory- whole DAG will be executed again once any new action will be encountered.
try
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
df.persist(StorageLevel.MEMORY_AND_DISK)

Spark + Scala transformations, immutability & memory consumption overheads

I have gone through some videos in Youtube regarding Spark architecture.
Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.
If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.
e.g.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?
Is my understanding correct? Is caching RDD is only solution to address this issue?
Once I start caching, it may spill over to disk due to large size and performance would be impacted due to disk IO operations. In that case, performance of Hadoop and Spark are comparable?
EDIT:
From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.
But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?
First off, the lazy execution means that functional composition can occur:
scala> val rdd = sc.makeRDD(List("This is a test", "This is another test",
"And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
| map(word => {println(word);(word,1)}).
| reduceByKey((x,y) => {println(s"$x+$y");x+y}).
| collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:
That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.
So, overall, Spark beats MR hands down in just about every regard.
The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.
When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).
I would recommend watching this talk and going through the slides.

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.