Spark streaming context hangs on stop - scala

I am trying to write a spark streaming program where I want to gracefully shutdown my application in case my application receives a shutdown hook. I wrote the following snippet to accomplish this.
sys.ShutdownHookThread {
println("Gracefully stopping MyStreamJob")
ssc.stop(stopSparkContext = true, stopGracefully = true)
println("Streaming stopped")
sys.exit(0)
}
On calling this code only the first println is called. That is the second println Streaming Stopped is never seen. The last message I receive on the console is:
39790 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/streaming,null}
39791 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/streaming/batch,null}
39792 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/static/streaming,null}
15/10/19 19:59:43 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/streaming,null}
I am using spark 1.4.1. I have to kill manually the job using kill -9 for spark to end. Is this the intended behaviour or am I doing something wrong?

Spark added its own call to stop the StreamingContext. See this email thread.
Your code would have worked prior to 1.4, now it will hang as you are experiencing. You can simply remove your hook and the graceful shutdown should happen automatically.
You can now use the following configuration parameter to specify if the shutdown should be graceful:
spark.streaming.stopGracefullyOnShutdown
The SparkContext will be stopped after the graceful shutdown. See:
"Do not stop SparkContext, let its own shutdown hook stop it"

Related

Flink job cant use savepoint in a batch job

Let me start in a generic fashion to see if I somehow missed some concepts: I have a streaming flink job from which I created a savepoint. Simplified version of this job looks like this
Pseduo-Code:
val flink = StreamExecutionEnvironment.getExecutionEnvironment
val stream = if (batchMode) {
flink.readFile(path)
}
else {
flink.addKafkaSource(topicName)
}
stream.keyBy(key)
stream.process(new ProcessorWithKeyedState())
CassandraSink.addSink(stream)
This works fine as long as I run the job without a savepoint. If I start the job from a savepoint I get an exception which looks like this
Caused by: java.lang.UnsupportedOperationException: Checkpoints are not supported in a single key state backend
at org.apache.flink.streaming.api.operators.sorted.state.NonCheckpointingStorageAccess.resolveCheckpoint(NonCheckpointingStorageAccess.java:43)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1623)
at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249)
I could work around this if I set the option:
execution.batch-state-backend.enabled: false
but this eventually results in another error:
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:673)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:526)
Of course I tried to set the config key taskmanager.memory.managed.consumer-weights (used DATAPROC:70,PYTHON:30) but this doesn't seems to have any effects.
So I wonder if I have a conceptual error and can't reuse savepoints from a streaming job in a batch job or if I simply have a problem in my configuration. Any hints?
After a hint from the flink user-group it turned out that it is NOT possible to reuse a savepoint from the streaming job (https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/execution_mode/#state-backends--state). So instead of running the job as in batch-mode (flink.setRuntimeMode(RuntimeExecutionMode.BATCH)) I just run it in the default execution mode (STREAMING). This has the minor downside that it will run forever and have to be stopped by someone once all data was processed.

Custom event handling for KafkaAdminClient

My goal is to do something when the broker is down, but couldn't manage to do it.
The code is simple:
val properties = new Properties()
properties.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
val client = AdminClient.create(properties)
//Suppose that the App just runs from here without consuming/producing
it starts up, then I manually shutdown kafka.
Logs arrives:
2021-06-23T13:51:16,681+02:00 WARN [kafka-admin-client-thread | adminclient-1] org.apache.kafka.clients.NetworkClient: [AdminClient clientId=adminclient-1] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
How to handle this? Basically I just want to invoke a custom method when the broker is down.
there is nothing I can 'catch'
And couldn't even find an evenListener in AdminClient/KafkaAdminClient (or I am just looking at the wrong place)
edit: And of course I would like to invoke my custom code too when the broker is back to life
You cant issue a command to a server that isn't running... You would need to run a Kafka Java process check on the broker server itself that does not use Kafka-related tools (e.g. jps or systemctl or checking some /var/run/kafka.pid)

Shutdown Hook for spark batch application

I have a spark scala batch application. It commits the run status to mariadb when it completes or fails.
I want to implement an edge case when the application is killed by say "yarn application -kill [appid]", I want to update the status as failed in mariadb table.
I planned to use "ShutdownHookManager" for the same but I see it is private in spark and the scala sys.ShutdownHookThread does not work as well.
Can somebody guide me on the shutdown hook handling of killing spark batch application. Not much resources on the same.
You can create a custom SparkListener that reacts on the onApplicationEnd event:
class MyListener extends SparkListener {
override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
println("Shutting down...")
}
}
This listener can then be added to the SparkContext:
spark.sparkContext.addSparkListener(new MyListener())
When the Spark application terminates, the string Shutting down... is printed on the console.

Unable to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi

I am new to Python and Apache ecosystem. I am trying to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi. I do not have detailed knowledge of any of the components being used here, I am only doing Googling and hit-and-trial.
In this way I have successfully configured and started Spark, NiFi and Livy in EMR. And I am able to submit Pyspark code via Livy in interactive session.
However, nothing happens when I configure ExecuteSparkInteractive to submit Pyspark code via Livy. Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor.
This is my configuration for LivySessionController:
This is the sample code I submit under properties in ExecuteSparkInteractive.
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Here is the code that works for me in interactive session:
import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'pyspark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
#Get the session URL
session_url = host + r.headers['Location']
sn_r = requests.get(session_url, headers=headers)
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
""")
}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
These are the log excerpts from nifi-app.log:
#After starting the processor
2018-07-18 06:38:11,768 INFO [NiFi Web Server-112] o.a.n.c.s.StandardProcessScheduler Starting ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:38:11,770 INFO [Monitor Processore Lifecycle Thread-1] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run with 1 threads
2018-07-18 06:38:11,883 INFO [Flow Service Tasks Thread-1] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#36fb0996 // Another save pending = false
2018-07-18 06:38:57,106 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#12830e23 checkpointed with 0 Records and 0 Swap Files in 7 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 2 millis), max Transaction ID -1
#After stopping the processor
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.c.s.StandardProcessScheduler Stopping ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.controller.StandardProcessorNode Stopping processor: class org.apache.nifi.processors.livy.ExecuteSparkInteractive
2018-07-18 06:39:09,838 INFO [Timer-Driven Process Thread-9] o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run
2018-07-18 06:39:09,917 INFO [Flow Service Tasks Thread-2] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#36fb0996 // Another save pending = false
Interestingly, when I enable LivySessionController in NiFi, the Livy UI shows two new sessions - the one created first shows in "idle" state, while the later (one with the greater Session Id) keeps showing in the "starting" state even after several refreshes. Let's give them Session Ids 1 and 2, respectively. Interestingly, Session Id 2 changes state from "starting" to "shutting_down" to "dead". As soon as it is dead, a new session (Session Id 3) is created with state "starting" which later becomes "idle". Below are log excerpts from these 3 sessions:
#Livy 1st session:
18/07/18 06:33:58 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
18/07/18 06:33:58 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-84-145.ec2.internal:4040
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Shutting down all executors
18/07/18 06:33:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/18 06:33:58 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Stopped
18/07/18 06:33:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/18 06:33:59 INFO MemoryStore: MemoryStore cleared
18/07/18 06:33:59 INFO BlockManager: BlockManager stopped
18/07/18 06:33:59 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/18 06:33:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/18 06:33:59 INFO SparkContext: Successfully stopped SparkContext
#Livy 2nd session:
18/07/18 06:34:30 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
#Livy 3rd session:
18/07/18 06:36:15 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
Few things here -
Livy session controller :-
Make sure you see 2 sessions per node when you enable the controller
service and both session on spark UI must be in running state (but
not performing any operation until python code with Nifi runs).
If you see unusual behavior then focus on getting it fixed first.
possible action - Add StandardSSLContextService controller and setup Keystore
and truststore. And use the same in LivySessionController (under property : SSL COntext Service)
Within Python Code :
I think you don't have to import SparkConf, SparkContext, also you don't need to create conf and sc. You only need to import Sparksession as below -
from pyspark.sql import SparkSession
and you can simply use spark (it's available by default as spark session variable)
e.g - spark.sql(s""" ....slq-statement.. """) or spark.sparkContext for sc
last thing which you mentioned "Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor."
FOr this you can add some dummy processor like updateAttribute after ExecuteSparkInteractive processor and keep it in disabled mode. Also you have to direct the output from spark interactive processor to updateAttribute in all 3 states (success, failure, wait). This way you will be able to see whats the outcome after pyspark code runs within nifi. Refer below diagram for sample.
I hope this will help you fix your issues.
Up Vote if you like the answer
Sample Nifi template to test PySpark code

Kafka Stream Startup Issue - org.apache.kafka.streams.errors.LockException

I have a Kafka Streams Application version - 0.11 which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers - version 0.11
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour. Whenever I take any kafka broker down, it throws below Exception:
org.apache.kafka.streams.errors.LockException: task [4_10] Failed to lock the state directory for task 4_10
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:99)
at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:80)
at org.apache.kafka.streams.processor.internals.StandbyTask.<init>(StandbyTask.java:62)
at org.apache.kafka.streams.processor.internals.StreamThread.createStandbyTask(StreamThread.java:1325)
at org.apache.kafka.streams.processor.internals.StreamThread.access$2400(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$StandbyTaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:254)
at org.apache.kafka.streams.processor.internals.StreamThread.addStandbyTasks(StreamThread.java:1366)
at org.apache.kafka.streams.processor.internals.StreamThread.access$1200(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsAssigned(StreamThread.java:185)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:265)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:363)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:310)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:297)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:582)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
I have read at few jira issues that cleaningUp the streams might help to fix the issue. But cleaningUp the streams everytime we start the Kafka Stream Application is a right solution or a patch? Also, stream cleanUp will delay the application startup right?
Note: Do I need to call streams.cleanUp() before calling streams.start(), each time I start the Kafka Streams application
Seeing a org.apache.kafka.streams.errors.LockException: task [4_10] Failed to lock the state directory for task 4_10 is actually expected and should resolve itself. The thread will back off in order to wait until another thread releases the lock and retries later. Thus, you might even see this WARN message is the logs multiple time in case the retry happens before the second thread did release the lock.
However, eventually the lock should be release by the second thread and the first thread will be able to get the lock. Afterwards, Streams should just move forward. Note, it's a WARN message and not an error.