Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0

Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0 - scala

Description
We have a Spark Streaming 1.5.2 application in Scala that reads JSON events from a Kinesis Stream, does some transformations/aggregations and writes the results to different S3 prefixes. The current batch interval is 60 seconds. We have 3000-7000 events/sec. We’re using checkpointing to protect us from losing aggregations.
It’s been working well for a while, recovering from exceptions and even cluster restarts. We recently recompiled the code for Spark Streaming 1.6.0, only changing the library dependencies in the build.sbt file. After running the code in a Spark 1.6.0 cluster for several hours, we’ve noticed the following:
“Input Rate” and “Processing Time” volatility has increased substantially (see the screenshots below) in 1.6.0.
Every few hours, there’s an ‘’Exception thrown while writing record: BlockAdditionEvent … to the WriteAheadLog. java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]” exception (see complete stack trace below) coinciding with the drop to 0 events/sec for specific batches (minutes).
After doing some digging, I think the second issue looks related to this Pull Request. The initial goal of the PR: “When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don’t get blocked by the driver for too long.”
We are checkpointing in S3 in Spark 1.5.2 and there are no performance/reliability issues. We’ve tested checkpointing in Spark 1.6.0 in S3 and local NAS and in both cases we’re receiving this exception. It looks like when it takes more than 5 seconds to checkpoint a batch, this exception arises and we’ve checked that the events for that batch are lost forever.
Questions
Is the increase in “Input Rate” and “Processing Time” volatility expected in Spark Streaming 1.6.0 and is there any known way of improving it?
Do you know of any workaround apart from these 2?:
1) To guarantee that it takes less than 5 seconds for the checkpointing sink to write all files. In my experience, you cannot guarantee that with S3, even for small batches. For local NAS, it depends on who’s in charge of infrastructure (difficult with cloud providers).
2) Increase the spark.streaming.driver.writeAheadLog.batchingTimeout property value.
Would you expect to lose any events in the described scenario? I'd think that if batch checkpointing fails, the shard/receiver Sequence Numbers wouldn't be increased and it would be retried at a later time.
Spark 1.5.2 Statistics - Screenshot
Spark 1.6.0 Statistics - Screenshot
Full Stack Trace
16/01/19 03:25:03 WARN ReceivedBlockTracker: Exception thrown while writing record: BlockAdditionEvent(ReceivedBlockInfo(0,Some(3521),Some(SequenceNumberRanges(SequenceNumberRange(StreamEventsPRD,shardId-000000000003,49558087746891612304997255299934807015508295035511636018,49558087746891612304997255303224294170679701088606617650), SequenceNumberRange(StreamEventsPRD,shardId-000000000004,49558087949939897337618579003482122196174788079896232002,49558087949939897337618579006984380295598368799020023874), SequenceNumberRange(StreamEventsPRD,shardId-000000000001,49558087735072217349776025034858012188384702720257294354,49558087735072217349776025038332464993957147037082320914), SequenceNumberRange(StreamEventsPRD,shardId-000000000009,49558088270111696152922722880993488801473174525649617042,49558088270111696152922722884455852348849472550727581842), SequenceNumberRange(StreamEventsPRD,shardId-000000000000,49558087841379869711171505550483827793283335010434154498,49558087841379869711171505554030816148032657077741551618), SequenceNumberRange(StreamEventsPRD,shardId-000000000002,49558087853556076589569225785774419228345486684446523426,49558087853556076589569225789389107428993227916817989666))),BlockManagerBasedStoreResult(input-0-1453142312126,Some(3521)))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:81)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:87)
at org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:321)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:500)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1230)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:498)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Source Code Extract
...
// Function to create a new StreamingContext and set it up
def setupContext(): StreamingContext = {
...
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
// Create a Kinesis DStream
val data = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName(),
InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, awsAccessKeyId, awsSecretKey)
...
ssc.checkpoint(checkpointDir)
ssc
}
// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(checkpointDir, setupContext)
ssc.start()
ssc.awaitTermination()

Following zero323's suggestion about posting my comment as an answer:
Increasing spark.streaming.driver.writeAheadLog.batchingTimeout solved the checkpointing timeout issue. We did it after making sure we had room for it. We have been testing it for a while now. So I only recommend increasing it after careful consideration.
DETAILS
We used these 2 settings in $SPARK_HOME/conf/spark-defaults.conf:
spark.streaming.driver.writeAheadLog.allowBatching true
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
Originally, we only had spark.streaming.driver.writeAheadLog.allowBatching set to true.
Before the change, we had reproduced the issue mentioned in the question ("...ReceivedBlockTracker: Exception thrown while writing record...") in a testing environment. It occurred every few hours. After the change, the issue disappeared. We ran it for several days before moving to production.
We had found that the getBatchingTimeout() method of the WriteAheadLogUtils class had a default value of 5000ms, as seen here:
def getBatchingTimeout(conf: SparkConf): Long = {
conf.getLong(DRIVER_WAL_BATCHING_TIMEOUT_CONF_KEY, defaultValue = 5000)
}

Related

Does skipped stages have any performance impact on Spark job?

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the updated StaticDF into memory after each update inside loop. This helps in skipping those additional stages which gets created with every new micro batch.
My questions -
1) Even though the total completed stages remains same as the increased stages are always skipped but can it cause a performance issue as there can be millions on skipped stages at one point of time?
2) What happens when somehow some part or all of cached RDD is not available? (node/executor failure). Spark documentation says that it doesn't materialise the whole data received from multiple micro batches so far so does it mean that it will need read all events again from Kafka to regenerate staticDF?
// one time creation of empty static(not streaming) dataframe
val staticDF_schema = new StructType()
.add("product_id", LongType)
.add("created_at", LongType)
var staticDF = sparkSession
.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], staticDF_schema)
// Note : streamingDF was created from Kafka source
streamingDF.writeStream
.trigger(Trigger.ProcessingTime(10000L))
.foreachBatch {
(micro_batch_DF: DataFrame) => {
// fetching max created_at for each product_id in current micro-batch
val staging_df = micro_batch_DF.groupBy("product_id")
.agg(max("created").alias("created"))
// Updating staticDF using current micro batch
staticDF = staticDF.unionByName(staging_df)
staticDF = staticDF
.withColumn("rnk",
row_number().over(Window.partitionBy("product_id").orderBy(desc("created_at")))
).filter("rnk = 1")
.drop("rnk")
.cache()
}

Even though the skipped stages doesn't need any computation but my job started failing after a certain number of batches. This was because of DAG growth with every batch execution, making it un-manageable and throwing stack overflow exception.
To avoid this, I had to break the spark lineage so that number of stages don't increase with every run (even if they are skipped)

Datasource V2 Reader (Spark Structured Streaming) - offsets out of order

I am currently implementing two custom readers using the V2 api for a spark structured streaming job. After the job runs for ~30-60 minutes, it bombs with:
Caused by: java.lang.RuntimeException: Offsets committed out of order: 608799 followed by 2982
I am repurposing the examples found here and it is bombing at line: 206.
Instead of using the twitter stream that is provided in the example I am implementing it for JMS & SQS.
My question is: has anyone encountered this issue? Or is there something wrong with that implementation?
Code snippet:
override def commit(end: Offset): Unit = {
internalLog(s"** commit($end) lastOffsetCommitted: $lastOffsetCommitted")
val newOffset = TwitterOffset.convert(end).getOrElse(
sys.error(s"TwitterStreamMicroBatchReader.commit() received an offset ($end) that did not " +
s"originate with an instance of this class")
)
val offsetDiff = (newOffset.offset - lastOffsetCommitted.offset).toInt
if (offsetDiff < 0) {
sys.error(s"Offsets committed out of order: $lastOffsetCommitted followed by $end")
}
tweetList.trimStart(offsetDiff)
lastOffsetCommitted = newOffset
}
I can't find an answer with my usual outlets. I did, however, see this. One point that was made is to delete checkpoint data - which doesn't seem like a viable solution in a production system. The other was that the source system doesn't maintain offset information? I was under the impression that spark would be handling the offset information by itself. If this second point is the problem, how can I ensure that the source system handles this paradigm.
Please let me know if I can provide more information.
Edit: Looking at the MicroBatchReader interface, the documentation for commit says:
/**
* Informs the source that Spark has completed processing all data for offsets less than or
* equal to `end` and will only request offsets greater than `end` in the future.
*/
void commit(Offset end);
So the question becomes, why is spark sending me commit offsets that has already been committed?

Answering my own question in case it helps someone,
I should have added more information to the question - this job is running on EMR and is using EFS to checkpoint data.
The problem occurred when I used Amazon's amazon-efs-utils to mount EFS. For some reason each worker was not able to see the other workers' reads and writes - as if EFS didn't mount.
The solution was to switch to nfs-utils to mount EFS (per AWS instructions) so that each worker could accurately read the checkpoint data.

End-to-end exactly once semantics in spark structured streaming

I am trying to understand if end-to-end exactly once semantics is compromised in spark structured streaming in the below scenario.
Scenario: Structured streaming job with kafka source and file sink is started. Kafka has 16 partitions and I am reading with 16 executors. I interrupted the job at the moment when a particular batch is incomplete. 8 out of 16 tasks completed and we have 8 output files generated. Now if I run the job again the batch starts and reads the data from the same offset range of previous incomplete batch producing 16 output files. Now the 8 output files of incomplete batch resulted in duplicates and the same has been confirmed by data comparision.

About Streaming end-to-end Exactly-Once, recommand u to read this poster on flink ( a similar framework with spark ) .
Briefly, store source/sink state when occurring checkpoint event.
rest of anwser from flink post.
So let’s put all of these different pieces together:
Once all of the operators complete their pre-commit, they issue a commit .
If at least one pre-commit fails, all others are aborted, and we roll back to the previous successfully-completed checkpoint.
After a successful pre-commit, the commit must be guaranteed to eventually succeed — both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, the application restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.

SparkException: Could not execute broadcast in time

I am using spark structured streaming to write some transformed dataframes using function:
def parquetStreamWriter(dataPath: String, checkpointPath: String)(df: DataFrame): Unit = {
df.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", checkpointPath)
.start(dataPath)
}
When I am calling this function less number of time in code (1 or 2 dataframes written) it works fine but when I am calling it for more number of times (like writing 15 to 20 dataframes in a loop, I am getting following exception and some of the jobs are failing in databricks:-
Caused by: org.apache.spark.SparkException: Could not execute broadcast in
time. You can disable broadcast join by setting
spark.sql.autoBroadcastJoinThreshold to -1.
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:191)
My transformation has one broadcast join but i tried removing broadcast in join in code but got same error.
I tried setting spark conf spark.sql.autoBroadcastJoinThreshold to -1. as mentioned in error but got same exception again.
Can you suggest where am i going wrong ?

It's difficult to judge w/o seeing the execution plan (esp. not sure about broadcasted volume), but increasing the spark.sql.broadcastTimeout could help (please find full configuration description here).

this can be solved by setting spark.sql.autoBroadcastJoinThreshold to a higher value
if one has no idea about the execution time for that particular dataframe u can directly set spark.sql.autoBroadcastJoinThreshold to -1 i.e. (spark.sql.autoBroadcastJoinThreshold -1) this will disable the time limit bound over the execution of the dataframe

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?

It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.

There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.