spark "java.io.IOException: No space left on device" - pyspark

i am running a pyspark job on ec2 cluster with 4 workers.
i get this error :
2018-07-05 08:20:44 WARN TaskSetManager:66 - Lost task 1923.0 in stage 18.0 (TID 21385, 10.0.5.97, executor 3): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:260)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:190)
at org.apache.spark.serializer.DummySerializerInstance$1.close(DummySerializerInstance.java:65)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:194)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
i looked at https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
tried increasing shuffle partitioned - same issue .
my data looks fairly evenly partitioned across executors .
i want to try the workaround of assigning Null or None to dataframes , the question is if it will indeed remove intermediate shuffle files , and if the linage will not be kept .
for instance if my code looks like this :
df1 = sqlContext.read.parquet(...)
df2= df1.filter()
df3 = df2.groupBy(*groupList).agg(....)
and i will put
df1 = Null
after like 1 - will it save shuffle space , isn't it needed and will be re-computed for df2 , df3 ?
second question -
will checkpointing df1 or df2 help by breaking the linage?
what is a feasible solution when dealing with data larger than my storage (around 400GB of raw data processed)
UPDATE
removing cache of a dataframe between 2 phases that needs this dataframe helped and i got no errors .
i wonder how it help with the intermediate shuffle files .

I did face the similar situation. The reason is the that while using group by operations and joins data will be shuffled. As this shuffle data is temporary data while executing in spark applications this will be stored in a directory that spark.local.dir in the spark-defaults.conf file is pointing to which normally is a tmp directory with less space.
In general to avoid this error in the spark-defaults.conf file update the spark.local.dir to a location which has more memory.

Related

How to tune save operation with partitionBy

I need to partition dataset data by 6 columns: region/year/month/day/id/quadkey
Where on top level I have just binary region state, and at very bottom level is actually where it is get into many partitions.
So lets say we have 2 regions/usually 1 year/usually 1 month/3-4 days/100-150 ids/ 50-200 quadkeys
When I performing this I get really unbalanced shuffle operation and sometimes executors are failed due to exceeding memory limits.
Also I've noticed from History UI that some tasks at hat phase are very big (~15Gb) when others are much smaller (~1Gb).
I've tried to play with
sqlContext.setConf("spark.sql.shuffle.partitions", "3000")
Also I've tried to extend number of executors, but with same memory settings. That the errors that I get:
19/04/10 09:47:36 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
...
// stage: DataFrame
val partitionColumns = List("region", "year", "month", "day", "id", "quadkey")
stage.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(destUrl)
I've expected to have balanced tasks at Save Stage, what shuffle settings I should set for this? Or I have to have executors with higher than 20-25 Gb memory ? What should be scaling approach in such case?
One approach could be to add more columns to repartition, & that column to have high cardinality (id of records or some random values)
if number of files become large then try setting numPartitions followed by partitioning columns.
df.repartition(numPartitions, partition_cols_including_high_cardinality_column:_*).write........
===========================================================================
Edit:
In scenarios where data is skewed with some partition combinations having more data than others, repartitioning them with same column might not be a good idea.
In repartition, all data matching partition key combination will first collected on same executor and one file will be produced if your partitionBy and repartition have same column arguments. So in this case, few partition combination will have files like ~15Gb and some like ~1Gb which is not ideal for datasources like HDFS
So what I'm suggesting here is to have repartition columns that distribute data evenly on executors. Consider this, we have repartitioned data on some column combination E, it produces lets say 400 rows for each executor to work on, then each executor will write its data based on partitionBy spec. And when you check your final output, each partition will have number of files equal to number of executors that received rows with same partitionBy spec. Number of executors are decided by repartition column spec.
What I suggested above is to have different set of columns for repartition which will help distribute data evenly on executors. And if for some reason its not possible on data, then add some random columns (technique called salting). Option of adding numPartitions fixes the upper bound of number of executors working on data thereby fixing number of files written to a partition directory. Setting numPartitions is extremely helpful when your repartition column has high cardinality as this can create many files in your output directories.
import org.apache.spark.sql.functions.rand
df.repartition(numPartitions, $"some_col_1", rand)
.write.partitionBy("some_col")
.parquet("partitioned_lake")
here by fixing numPartitions, we are sure that output for every partitionBy spec will have maximum of numPartitions files.
helpful link - http://tantusdata.com/spark-shuffle-case-2-repartitioning-skewed-data/
Hope this helps

SparkException: Could not execute broadcast in time

I am using spark structured streaming to write some transformed dataframes using function:
def parquetStreamWriter(dataPath: String, checkpointPath: String)(df: DataFrame): Unit = {
df.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", checkpointPath)
.start(dataPath)
}
When I am calling this function less number of time in code (1 or 2 dataframes written) it works fine but when I am calling it for more number of times (like writing 15 to 20 dataframes in a loop, I am getting following exception and some of the jobs are failing in databricks:-
Caused by: org.apache.spark.SparkException: Could not execute broadcast in
time. You can disable broadcast join by setting
spark.sql.autoBroadcastJoinThreshold to -1.
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:191)
My transformation has one broadcast join but i tried removing broadcast in join in code but got same error.
I tried setting spark conf spark.sql.autoBroadcastJoinThreshold to -1. as mentioned in error but got same exception again.
Can you suggest where am i going wrong ?
It's difficult to judge w/o seeing the execution plan (esp. not sure about broadcasted volume), but increasing the spark.sql.broadcastTimeout could help (please find full configuration description here).
this can be solved by setting spark.sql.autoBroadcastJoinThreshold to a higher value
if one has no idea about the execution time for that particular dataframe u can directly set spark.sql.autoBroadcastJoinThreshold to -1 i.e. (spark.sql.autoBroadcastJoinThreshold -1) this will disable the time limit bound over the execution of the dataframe

Spark : Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs

Folks,
I am running into this issue while I am trying to join 2 large dataframes (100GB + each) in spark on one single key identifier per row.
I am using Spark 1.6 on EMR and here is what I am doing :
val df1 = sqlContext.read.json("hdfs:///df1/")
val df2 = sqlContext.read.json("hdfs:///df2/")
// clean up and filter steps later
df1.registerTempTable("df1")
df2.registerTempTable("df2")
val df3 = sql("select df1.*, df2.col1 from df1 left join df2 on df1.col3 = df2.col4")
df3.write.json("hdfs:///df3/")
This is basically the gist of what I am doing, among other clean-up and filter steps in between to join df1 and df2 finally.
Error I am seeing is :
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Configuration and References :
I am using 13 node 60GB each cluster with executors and driver memory set accordingly with overheads. Things I have tried adjusting :
spark.sql.broadcastTimeout
spark.sql.shuffle.partitions
I have also tried using bigger cluster, but did not help. This link says if Shuffle partition size exceeds 2GB, this error is thrown. But I have tried increasing number of partitions to a much high value, still no luck.
I suspect this could be something related to lazy loading. When I do 10 operations on a DF, they are only executed at the last step. I tried adding .persist() on various storage levels for DFs, still it doesn't succeed. I have also tried dropping temp tables, emptying all the earlier DFs for clean up.
However the code works if I break it down into 2 parts - writing the final temp data (2 data frames) to the disk, exiting. Restarting to only join the two DFs.
I was earlier getting this error :
Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.toJSON(DataFrame.scala:1724)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
But when I adjusted spark.sql.broadcastTimeout, I started getting the first error.
Would appreciate any help in this case. I can add more info if needed.
In spark u can’t have shuffle block larger than 2GB. This is because,
Spark stores shuffle blocks as ByteBuffer. Here’s how you allocate it:
ByteBuffer.allocate(int capacity)
As, ByteBuffer’s are limited by Integer.MAX_SIZE (2GB), So are shuffle blocks!! Solution is to Increase number of partitions either by using spark.sql.shuffle.partitions in SparkSQL or by rdd.partition() or rdd.colease() for rdd's such that each partition size is <= 2GB.
You mentioned that you tried to increase num of partitions and still it failed. Can you check if the partition size was > 2GB. Just make sure that number of partitions specified is sufficient enough to make each block size < 2GB

Stack overflow error when loading a large table from mongodb to spark

all,
I have a table which is about 1TB in mongodb. I tried to load it in spark using mongo connector but I keep getting stack overflow after 18 minutes executing.
java.lang.StackOverflowError:
at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
....
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
16/06/29 08:42:22 INFO YarnAllocator: Driver requested a total number of 54692 executor(s).
16/06/29 08:42:22 INFO YarnAllocator: Will request 46501 executor containers, each with 4 cores and 5068 MB memory including 460 MB overhead
Is it because I didn't provide enough memory ? Or should I provide more storage?
I have tried to add checkpoint, but it doesn't help.
I have changed some value in my code because they relate to my company database, but the whole code is still ok for this question.
val sqlContext = new SQLContext(sc)
val builder = MongodbConfigBuilder(Map(Host -> List("mymongodurl:mymongoport"), Database -> "mymongoddb", Collection ->"mymongocollection", SamplingRatio -> 0.01, WriteConcern -> "normal"))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
mongoRDD.registerTempTable("mytable")
val dataFrame = sqlContext.sql("SELECT u_at, c_at FROM mytable")
val deltaCollect = dataFrame.filter("u_at is not null and c_at is not null and u_at != c_at").rdd
val mapDelta = deltaCollect.map {
case Row(u_at: Date, c_at: Date) =>{
if(u_at.getTime == c_at.getTime){
(0.toString, 0l)
}
else{
val delta = ( u_at.getTime - c_at.getTime ) / 1000/60/60/24
(delta.toString, 1l)
}
}
}
val reduceRet = mapDelta.reduceByKey(_+_)
val OUTPUT_PATH = s"./dump"
reduceRet.saveAsTextFile(OUTPUT_PATH)
As you know, Apache Spark does in-memory processing while executing a job, i.e. it loads the data to be worked on into the memory. Here as per your question and comments, you have a dataset as large as 1TB and the memory available to Spark is around 8GB per core. Hence your spark executor will always be out of memory in this scenario.
To avoid this you can follow either of the below two options:
Change your RDD Storage Level to MEMORY_AND_DISK. In this way Spark will not load the full data into its memory; rather it will try to spill the extra data into disk. But, this way the performance will decrease because of the transactions done between the memory and disk. Check out RDD persistence
Increase your core memory so that Spark can load even 1TB of data fully into the memory. In this way the performance will be good, but infrastructure cost will increase.
I add another java option "-Xss32m" to spark driver to raise the memory of stack for every thread , and this exception is not throwing any more. How stupid was I , I should have tried it earlier. But another problem is shown, I will have to check more. still great thanks for your help.

Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0

Description
We have a Spark Streaming 1.5.2 application in Scala that reads JSON events from a Kinesis Stream, does some transformations/aggregations and writes the results to different S3 prefixes. The current batch interval is 60 seconds. We have 3000-7000 events/sec. We’re using checkpointing to protect us from losing aggregations.
It’s been working well for a while, recovering from exceptions and even cluster restarts. We recently recompiled the code for Spark Streaming 1.6.0, only changing the library dependencies in the build.sbt file. After running the code in a Spark 1.6.0 cluster for several hours, we’ve noticed the following:
“Input Rate” and “Processing Time” volatility has increased substantially (see the screenshots below) in 1.6.0.
Every few hours, there’s an ‘’Exception thrown while writing record: BlockAdditionEvent … to the WriteAheadLog. java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]” exception (see complete stack trace below) coinciding with the drop to 0 events/sec for specific batches (minutes).
After doing some digging, I think the second issue looks related to this Pull Request. The initial goal of the PR: “When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don’t get blocked by the driver for too long.”
We are checkpointing in S3 in Spark 1.5.2 and there are no performance/reliability issues. We’ve tested checkpointing in Spark 1.6.0 in S3 and local NAS and in both cases we’re receiving this exception. It looks like when it takes more than 5 seconds to checkpoint a batch, this exception arises and we’ve checked that the events for that batch are lost forever.
Questions
Is the increase in “Input Rate” and “Processing Time” volatility expected in Spark Streaming 1.6.0 and is there any known way of improving it?
Do you know of any workaround apart from these 2?:
1) To guarantee that it takes less than 5 seconds for the checkpointing sink to write all files. In my experience, you cannot guarantee that with S3, even for small batches. For local NAS, it depends on who’s in charge of infrastructure (difficult with cloud providers).
2) Increase the spark.streaming.driver.writeAheadLog.batchingTimeout property value.
Would you expect to lose any events in the described scenario? I'd think that if batch checkpointing fails, the shard/receiver Sequence Numbers wouldn't be increased and it would be retried at a later time.
Spark 1.5.2 Statistics - Screenshot
Spark 1.6.0 Statistics - Screenshot
Full Stack Trace
16/01/19 03:25:03 WARN ReceivedBlockTracker: Exception thrown while writing record: BlockAdditionEvent(ReceivedBlockInfo(0,Some(3521),Some(SequenceNumberRanges(SequenceNumberRange(StreamEventsPRD,shardId-000000000003,49558087746891612304997255299934807015508295035511636018,49558087746891612304997255303224294170679701088606617650), SequenceNumberRange(StreamEventsPRD,shardId-000000000004,49558087949939897337618579003482122196174788079896232002,49558087949939897337618579006984380295598368799020023874), SequenceNumberRange(StreamEventsPRD,shardId-000000000001,49558087735072217349776025034858012188384702720257294354,49558087735072217349776025038332464993957147037082320914), SequenceNumberRange(StreamEventsPRD,shardId-000000000009,49558088270111696152922722880993488801473174525649617042,49558088270111696152922722884455852348849472550727581842), SequenceNumberRange(StreamEventsPRD,shardId-000000000000,49558087841379869711171505550483827793283335010434154498,49558087841379869711171505554030816148032657077741551618), SequenceNumberRange(StreamEventsPRD,shardId-000000000002,49558087853556076589569225785774419228345486684446523426,49558087853556076589569225789389107428993227916817989666))),BlockManagerBasedStoreResult(input-0-1453142312126,Some(3521)))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:81)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:87)
at org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:321)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:500)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1230)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:498)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Source Code Extract
...
// Function to create a new StreamingContext and set it up
def setupContext(): StreamingContext = {
...
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
// Create a Kinesis DStream
val data = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName(),
InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, awsAccessKeyId, awsSecretKey)
...
ssc.checkpoint(checkpointDir)
ssc
}
// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(checkpointDir, setupContext)
ssc.start()
ssc.awaitTermination()
Following zero323's suggestion about posting my comment as an answer:
Increasing spark.streaming.driver.writeAheadLog.batchingTimeout solved the checkpointing timeout issue. We did it after making sure we had room for it. We have been testing it for a while now. So I only recommend increasing it after careful consideration.
DETAILS
We used these 2 settings in $SPARK_HOME/conf/spark-defaults.conf:
spark.streaming.driver.writeAheadLog.allowBatching true
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
Originally, we only had spark.streaming.driver.writeAheadLog.allowBatching set to true.
Before the change, we had reproduced the issue mentioned in the question ("...ReceivedBlockTracker: Exception thrown while writing record...") in a testing environment. It occurred every few hours. After the change, the issue disappeared. We ran it for several days before moving to production.
We had found that the getBatchingTimeout() method of the WriteAheadLogUtils class had a default value of 5000ms, as seen here:
def getBatchingTimeout(conf: SparkConf): Long = {
conf.getLong(DRIVER_WAL_BATCHING_TIMEOUT_CONF_KEY, defaultValue = 5000)
}