Does skipped stages have any performance impact on Spark job? - scala

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the updated StaticDF into memory after each update inside loop. This helps in skipping those additional stages which gets created with every new micro batch.
My questions -
1) Even though the total completed stages remains same as the increased stages are always skipped but can it cause a performance issue as there can be millions on skipped stages at one point of time?
2) What happens when somehow some part or all of cached RDD is not available? (node/executor failure). Spark documentation says that it doesn't materialise the whole data received from multiple micro batches so far so does it mean that it will need read all events again from Kafka to regenerate staticDF?
// one time creation of empty static(not streaming) dataframe
val staticDF_schema = new StructType()
.add("product_id", LongType)
.add("created_at", LongType)
var staticDF = sparkSession
.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], staticDF_schema)
// Note : streamingDF was created from Kafka source
streamingDF.writeStream
.trigger(Trigger.ProcessingTime(10000L))
.foreachBatch {
(micro_batch_DF: DataFrame) => {
// fetching max created_at for each product_id in current micro-batch
val staging_df = micro_batch_DF.groupBy("product_id")
.agg(max("created").alias("created"))
// Updating staticDF using current micro batch
staticDF = staticDF.unionByName(staging_df)
staticDF = staticDF
.withColumn("rnk",
row_number().over(Window.partitionBy("product_id").orderBy(desc("created_at")))
).filter("rnk = 1")
.drop("rnk")
.cache()
}

Even though the skipped stages doesn't need any computation but my job started failing after a certain number of batches. This was because of DAG growth with every batch execution, making it un-manageable and throwing stack overflow exception.
To avoid this, I had to break the spark lineage so that number of stages don't increase with every run (even if they are skipped)

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

Transform and persist Spark DStream into several separate locations, in parallel?

I have an use case of a DStream that contains data with several levels of nesting, and I have a requirement to persist different elements from that data into separate HDFS locations. I managed to work this out by using Spark SQL, as below:
val context = new StreamingContext(sparkConf, Seconds(duration))
val stream = context.receiverStream(receiver)
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.drop("childRecords").write.parquet("ParentTable")
}
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.select(explode(col("childRecords")).as("children"))
.select("children.*").write.parquet("ChildTable")
}
// repeat as necessary if parent table has more different kinds of child records,
// or if child table itself also has child records too
The code works, but the only issue I have with it, is that the persistence runs sequentially - the first stream.foreachRDD has to complete before the second one starts, etc. What I'd like to see ideally is for the persistence job for ChildTable to start without waiting for ParentTable to finish, as they're writing to different locations and would not conflict. In reality, I have about 10 different jobs all waiting to complete sequentially, and would probably see a big improvement in execution time if I'd be able to run them all in parallel.

Why SPARK repeat transformations after persist operations?

I have next code. I am doing count to perform persist operation and fix transformations above. But I noticed that DAG and stages for 2 different count Jobs calls first persist twice (when I expect second persist method to be called in second count call)
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
.persist(StorageLevel.MEMORY_AND_DISK)
LOG.info(s"First count = " + df.count)
val filter: BaseFilter = new BaseFilter()
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
// Since we will perform count and partitionBy actions, compute all above transformations/ Second persist
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"Second count = $totalPoints")
When you say StorageLevel.MEMORY_AND_DISK spark tries to fit all the data into the memory and if it doesn't fit it spills to disk.
Now you are doing multiple persists here. In spark the memory cache is LRU so the later persists will overwrite the previous cached data.
Even if you specify StorageLevel.MEMORY_AND_DISK when the data is evicted from cache memory by another cached data spark doesn't spill that to the disk. So when you do the next count it needs to revaluate the DAG so that it can retrieve the partitions which aren't present in the cache.
I would suggest you to use StorageLevel.DISK_ONLY to avoid such re-computation.
Here's is the whole scenario.
persist and cache are also the transformation in Spark. After applying any one of the stated transformation, one should use any action in order to cache an RDD or DF to the memory.
Secondly, The unit of cache or persist is "partition". When cache or persist gets executed it will save only those partitions which can be hold in the memory. The remaining partition which cannot be saved on the memory- whole DAG will be executed again once any new action will be encountered.
try
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
df.persist(StorageLevel.MEMORY_AND_DISK)

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.