Shutdown Hook for spark batch application - scala

I have a spark scala batch application. It commits the run status to mariadb when it completes or fails.
I want to implement an edge case when the application is killed by say "yarn application -kill [appid]", I want to update the status as failed in mariadb table.
I planned to use "ShutdownHookManager" for the same but I see it is private in spark and the scala sys.ShutdownHookThread does not work as well.
Can somebody guide me on the shutdown hook handling of killing spark batch application. Not much resources on the same.

You can create a custom SparkListener that reacts on the onApplicationEnd event:
class MyListener extends SparkListener {
override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
println("Shutting down...")
}
}
This listener can then be added to the SparkContext:
spark.sparkContext.addSparkListener(new MyListener())
When the Spark application terminates, the string Shutting down... is printed on the console.

Related

Flink job cant use savepoint in a batch job

Let me start in a generic fashion to see if I somehow missed some concepts: I have a streaming flink job from which I created a savepoint. Simplified version of this job looks like this
Pseduo-Code:
val flink = StreamExecutionEnvironment.getExecutionEnvironment
val stream = if (batchMode) {
flink.readFile(path)
}
else {
flink.addKafkaSource(topicName)
}
stream.keyBy(key)
stream.process(new ProcessorWithKeyedState())
CassandraSink.addSink(stream)
This works fine as long as I run the job without a savepoint. If I start the job from a savepoint I get an exception which looks like this
Caused by: java.lang.UnsupportedOperationException: Checkpoints are not supported in a single key state backend
at org.apache.flink.streaming.api.operators.sorted.state.NonCheckpointingStorageAccess.resolveCheckpoint(NonCheckpointingStorageAccess.java:43)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1623)
at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249)
I could work around this if I set the option:
execution.batch-state-backend.enabled: false
but this eventually results in another error:
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:673)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:526)
Of course I tried to set the config key taskmanager.memory.managed.consumer-weights (used DATAPROC:70,PYTHON:30) but this doesn't seems to have any effects.
So I wonder if I have a conceptual error and can't reuse savepoints from a streaming job in a batch job or if I simply have a problem in my configuration. Any hints?
After a hint from the flink user-group it turned out that it is NOT possible to reuse a savepoint from the streaming job (https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/execution_mode/#state-backends--state). So instead of running the job as in batch-mode (flink.setRuntimeMode(RuntimeExecutionMode.BATCH)) I just run it in the default execution mode (STREAMING). This has the minor downside that it will run forever and have to be stopped by someone once all data was processed.

Why is a streaming query still up and running after StreamingQueryManager.awaitAnyTermination?

I want to terminate the spark mapping after a specific time. I'm using sqlContext.streams.awaitAnyTermination(long timeoutMs) for that. But the mapping is not stopping after the given timeout.
I have tried to read from azure event hub and provided 2 min (120000 ms) as a timeout for awaitAnyTermination method. but the mapping is not stopping on azure databricks cluster.
Below is my code. I'm reading from azure eventhub and writing to console and 120000ms to awaitAnyTermination.
import org.apache.spark.eventhubs._
// Event hub configurations
// Replace values below with yours
import org.apache.spark.eventhubs.ConnectionStringBuilder
val connStr = ConnectionStringBuilder()
.setNamespaceName("iisqaeventhub")
.setEventHubName("devsource")
.setSasKeyName("RootManageSharedAccessKey")
.setSasKey("saskey")
.build
val customEventhubParameters = EventHubsConf(connStr).setMaxEventsPerTrigger(5).setStartingPosition(EventPosition.fromEndOfStream)
// reading from the Azure event hub
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()
// write to console
val query = incomingStream.writeStream
.outputMode("append")
.format("console")
.start()
// awaitAnyTermination for shutting down the query
sqlContext.streams.awaitAnyTermination(120000)
I am expecting that mapping should have ended after a timeout. No error but mapping is not stopping.
tl;dr Works as designed.
From the official documentation:
awaitAnyTermination(timeoutMs: Long): Boolean
Returns whether any query has terminated or not (multiple may have terminated).
In other words, no streaming query is going to be terminated at any point in time (before or after the timeoutMs) unless there is an exception or stop.
When using DataBricks and prototyping, this is what I use to stop Spark Structured Streaming Apps in a separate Notebook pane:
import org.apache.spark.streaming._
StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }

Is FAIR available for Spark Standalone cluster mode?

I'm having 2 node cluster with spark standalone cluster manager. I'm triggering more than one job using same sc with Scala multi threading.What I found is my jobs are scheduled one after another because of FIFO nature so I tried to use FAIR scheduling
conf.set("spark.scheduler.mode", "FAIR")
conf.set("spark.scheduler.allocation.file", sys.env("SPARK_HOME") + "/conf/fairscheduler.xml")
val job1 = Future {
val job = new Job1()
job.run()
}
val job2 =Future {
val job = new Job2()
job.run()
}
class Job1{
def run()
sc.setLocalProperty("spark.scheduler.pool", "mypool1")
}
}
class Job2{
def run()
sc.setLocalProperty("spark.scheduler.pool", "mypool2")
}
}
<pool name="mypool1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="mypool2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
Job1 and Job2 will be triggered from an launcher class. Even after setting these properties, my jobs are handled in FIFO.
Is FAIR available for Spark Standalone cluster mode?Is there a page
where it's described in more details? I can't seem to find much about
FAIR and Standalone in Job Scheduling.I'm following this SOF question.am I missing anything here ?
I don't think standalone is the problem. You described creating only one pool, so I think your problem is that you need at least one more pool and assign each job to a different pool.
FAIR scheduling is done across pools, anything within the same pool will run in FIFO mode anyway.
This is based on the documentation here:
https://spark.apache.org/docs/latest/job-scheduling.html#default-behavior-of-pools

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

Spark streaming context hangs on stop

I am trying to write a spark streaming program where I want to gracefully shutdown my application in case my application receives a shutdown hook. I wrote the following snippet to accomplish this.
sys.ShutdownHookThread {
println("Gracefully stopping MyStreamJob")
ssc.stop(stopSparkContext = true, stopGracefully = true)
println("Streaming stopped")
sys.exit(0)
}
On calling this code only the first println is called. That is the second println Streaming Stopped is never seen. The last message I receive on the console is:
39790 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/streaming,null}
39791 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/streaming/batch,null}
39792 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/static/streaming,null}
15/10/19 19:59:43 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/streaming,null}
I am using spark 1.4.1. I have to kill manually the job using kill -9 for spark to end. Is this the intended behaviour or am I doing something wrong?
Spark added its own call to stop the StreamingContext. See this email thread.
Your code would have worked prior to 1.4, now it will hang as you are experiencing. You can simply remove your hook and the graceful shutdown should happen automatically.
You can now use the following configuration parameter to specify if the shutdown should be graceful:
spark.streaming.stopGracefullyOnShutdown
The SparkContext will be stopped after the graceful shutdown. See:
"Do not stop SparkContext, let its own shutdown hook stop it"