Unable to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi - pyspark

I am new to Python and Apache ecosystem. I am trying to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi. I do not have detailed knowledge of any of the components being used here, I am only doing Googling and hit-and-trial.
In this way I have successfully configured and started Spark, NiFi and Livy in EMR. And I am able to submit Pyspark code via Livy in interactive session.
However, nothing happens when I configure ExecuteSparkInteractive to submit Pyspark code via Livy. Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor.
This is my configuration for LivySessionController:
This is the sample code I submit under properties in ExecuteSparkInteractive.
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Here is the code that works for me in interactive session:
import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'pyspark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
#Get the session URL
session_url = host + r.headers['Location']
sn_r = requests.get(session_url, headers=headers)
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
""")
}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
These are the log excerpts from nifi-app.log:
#After starting the processor
2018-07-18 06:38:11,768 INFO [NiFi Web Server-112] o.a.n.c.s.StandardProcessScheduler Starting ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:38:11,770 INFO [Monitor Processore Lifecycle Thread-1] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run with 1 threads
2018-07-18 06:38:11,883 INFO [Flow Service Tasks Thread-1] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#36fb0996 // Another save pending = false
2018-07-18 06:38:57,106 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#12830e23 checkpointed with 0 Records and 0 Swap Files in 7 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 2 millis), max Transaction ID -1
#After stopping the processor
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.c.s.StandardProcessScheduler Stopping ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.controller.StandardProcessorNode Stopping processor: class org.apache.nifi.processors.livy.ExecuteSparkInteractive
2018-07-18 06:39:09,838 INFO [Timer-Driven Process Thread-9] o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run
2018-07-18 06:39:09,917 INFO [Flow Service Tasks Thread-2] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#36fb0996 // Another save pending = false
Interestingly, when I enable LivySessionController in NiFi, the Livy UI shows two new sessions - the one created first shows in "idle" state, while the later (one with the greater Session Id) keeps showing in the "starting" state even after several refreshes. Let's give them Session Ids 1 and 2, respectively. Interestingly, Session Id 2 changes state from "starting" to "shutting_down" to "dead". As soon as it is dead, a new session (Session Id 3) is created with state "starting" which later becomes "idle". Below are log excerpts from these 3 sessions:
#Livy 1st session:
18/07/18 06:33:58 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
18/07/18 06:33:58 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-84-145.ec2.internal:4040
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Shutting down all executors
18/07/18 06:33:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/18 06:33:58 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Stopped
18/07/18 06:33:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/18 06:33:59 INFO MemoryStore: MemoryStore cleared
18/07/18 06:33:59 INFO BlockManager: BlockManager stopped
18/07/18 06:33:59 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/18 06:33:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/18 06:33:59 INFO SparkContext: Successfully stopped SparkContext
#Livy 2nd session:
18/07/18 06:34:30 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
#Livy 3rd session:
18/07/18 06:36:15 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

Few things here -
Livy session controller :-
Make sure you see 2 sessions per node when you enable the controller
service and both session on spark UI must be in running state (but
not performing any operation until python code with Nifi runs).
If you see unusual behavior then focus on getting it fixed first.
possible action - Add StandardSSLContextService controller and setup Keystore
and truststore. And use the same in LivySessionController (under property : SSL COntext Service)
Within Python Code :
I think you don't have to import SparkConf, SparkContext, also you don't need to create conf and sc. You only need to import Sparksession as below -
from pyspark.sql import SparkSession
and you can simply use spark (it's available by default as spark session variable)
e.g - spark.sql(s""" ....slq-statement.. """) or spark.sparkContext for sc
last thing which you mentioned "Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor."
FOr this you can add some dummy processor like updateAttribute after ExecuteSparkInteractive processor and keep it in disabled mode. Also you have to direct the output from spark interactive processor to updateAttribute in all 3 states (success, failure, wait). This way you will be able to see whats the outcome after pyspark code runs within nifi. Refer below diagram for sample.
I hope this will help you fix your issues.
Up Vote if you like the answer
Sample Nifi template to test PySpark code

Related

Hudi: Access to timeserver times out in embedded mode

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver.
The following is based on Hudi: Getting Started guide.
Spark Session setup:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()
Code which causes timeout exception:
val inserts = convertToStringList(dataGen.generateInserts(10))
var df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
The timeout and exception throws:
170762 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating remote view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661. Server=xxx:59520
170766 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating InMemory based view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661
170769 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://xxx:59520/v1/hoodie/view/datafiles/beforeoron/latest/?partition=americas%2Funited_states%2Fsan_francisco&maxinstant=20201221180946&basepath=%2Fvar%2Ffolders%2Fz9%2F_9mf84p97hz1n45b0gnpxlj40000gp%2FT%2FHudiQuickStartSpec-hudi_trips_cow2193648737745630661&lastinstantts=20201221180946&timelinehash=70f7aa073fa3d86033278a59cbda71c6488f4883570d826663ebb51934a25abf)
246649 [Executor task launch worker for task 47] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesFromParams(RemoteHoodieTableFileSystemView.java:223)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesBeforeOrOn(RemoteHoodieTableFileSystemView.java:230)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:134)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$19c2c1bb$1(HoodieBloomIndex.java:201)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
I wasn't able to experiment with different port settings for Hudi timeserver port as I wasn't able to find the config setting that controls the port.
Any ideas why access to the timeserver times out?
The problem turned out to be rooted in the way Hudi resolves spark driver host. It seems that although it starts and binds its web server to localhost, Hudi's client subsequently uses the IP address to make calls to the server it started.
5240 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Starting Javalin ...
5348 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Listening on http://localhost:59520/
...
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
The solution is to configure "spark.driver.host" setting explicitly. The following worked for me:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.driver.host", "localhost")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()

Spark worker nodes timeout

When I run my Spark app using sbt run with configuration pointing to master of a remote cluster nothing useful gets executed by the workers and the following warning is printed in sbt run log repeatedly.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
This is how my spark config looks like:
#transient lazy val conf: SparkConf = new SparkConf()
.setMaster("spark://master-ip:7077")
.setAppName("HelloWorld")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "12g")
#transient lazy val sc: SparkContext = new SparkContext(conf)
val lines = sc.textFile("hdfs://master-public-dns:9000/test/1000.csv")
I know this warning usually appears when the cluster is misconfigured and the workers either don't have the resources or aren't started in the first place. However, according to my Spark UI (on master-ip:8080) the worker nodes seem to be alive with sufficient RAM and cpu cores, they even try to execute my app but they exit and leave this in stderr log:
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled;
users with view permissions: Set(ubuntu, myuser);
groups with view permissions: Set(); users with modify permissions: Set(ubuntu, myuser); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
... 8 more
ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Any ideas?
Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
Could you telnet to this port on this ip from worker, maybe your driver machine has multiple network interfaces, try to set SPARK_LOCAL_IP in $SPARK_HOME/conf/spark-env.sh

SparkException: Matlab worker did not connect back in time

I am using the following code from the Matlab documentation of matlab.compiler.mlspark.RDD class.
%% Connect to Spark
sparkProp = containers.Map({'spark.executor.cores'}, {'1'});
conf = matlab.compiler.mlspark.SparkConf('AppName','myApp', ...
'Master','local[1]','SparkProperties',sparkProp);
sc = matlab.compiler.mlspark.SparkContext(conf);
%% flatMap
inRDD = sc.parallelize({'A','B'});
flatRDD = inRDD.flatMap(#(x)({x,1}));
viewRes = flatRDD.collect()
%% Delete Spark Context
delete(sc)
When I execute the code, I am able to connect to the Spark using Matlab and the Matlab worker also starts. As soon as the Matlab worker starts I get the following exception and the worker shuts down.
LOGS:
17/03/15 21:20:02 INFO MatlabWorkerFactory: Matlab worker factory
create simple worker
17/03/15 21:20:02 INFO MatlabWorkerFactory: Launching MATLAB
17/03/15 21:20:02 INFO MatlabWorkerFactory: Matlab worker process started
17/03/15 21:21:02 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Matlab worker did not connect back in time
Looks like a configuration issue to me. Having said that, I am very new to both Spark and Matlab, therefore will appreciate any help on it.

Spark Streaming gets stopped without errors after ~1 minute

When I make spark-submit for the Spart Streaming job, I can see that it's running during approximatelly 1 minute, and then it's stopped with the final status SUCCEEDED:
16/11/16 18:58:16 INFO yarn.Client: Application report for application_XXXX_XXX (state: RUNNING)
16/11/16 18:58:17 INFO yarn.Client: Application report for application_XXXX_XXX (state: FINISHED)
I don't understand why it gets stopped, while I expect it to run for an undefined time and be triggered by messages received from the Kafka queue. In logs I can see all the println outputs, and there are no errors.
This is a short extract from the code:
val conf = new SparkConf().setAppName("MYTEST")
val sc = new SparkContext(conf)
sc.setCheckpointDir("~/checkpointDir")
val ssc = new StreamingContext(sc, Seconds(batch_interval_seconds))
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
println("Dividing the topic into partitions.")
val inputKafkaTopicMap = inputKafkaTopic.split(",").map((_, kafkaNumThreads)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, inputKafkaTopicMap).map(_._2)
messages.foreachRDD(msg => {
msg.foreach(s => {
if (s != null) {
//val result = ... processing goes here
//println(result)
}
})
})
// Start the streaming context in the background.
ssc.start()
This is my spark-submit command:
/usr/bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 10g --num-executors 2 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC \
-XX:+AlwaysPreTouch" --class org.test.StreamingRunner test.jar param1 param2
When I open Resource Manager, I see that no job is RUNNING and the spark streaming job gets marked as FINISHED.
Your code is missing a call to ssc.awaitTermination to block the driver thread.
Unfortunately there's no easy way to see the printouts from inside your map function on the console, since those function calls are happening inside YARN executors. Cloudera Manager provides a decent look at the logs though, and if you really need them collected on the driver you can write to a location in HDFS and then scrape the various logs from there yourself. If the information that you want to track is purely numeric you might also consider using an Accumulator.

Spark streaming context hangs on stop

I am trying to write a spark streaming program where I want to gracefully shutdown my application in case my application receives a shutdown hook. I wrote the following snippet to accomplish this.
sys.ShutdownHookThread {
println("Gracefully stopping MyStreamJob")
ssc.stop(stopSparkContext = true, stopGracefully = true)
println("Streaming stopped")
sys.exit(0)
}
On calling this code only the first println is called. That is the second println Streaming Stopped is never seen. The last message I receive on the console is:
39790 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/streaming,null}
39791 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/streaming/batch,null}
39792 [shutdownHook1] INFO org.spark-project.jetty.server.handler.ContextHandler - stopped o.s.j.s.ServletContextHandler{/static/streaming,null}
15/10/19 19:59:43 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/streaming,null}
I am using spark 1.4.1. I have to kill manually the job using kill -9 for spark to end. Is this the intended behaviour or am I doing something wrong?
Spark added its own call to stop the StreamingContext. See this email thread.
Your code would have worked prior to 1.4, now it will hang as you are experiencing. You can simply remove your hook and the graceful shutdown should happen automatically.
You can now use the following configuration parameter to specify if the shutdown should be graceful:
spark.streaming.stopGracefullyOnShutdown
The SparkContext will be stopped after the graceful shutdown. See:
"Do not stop SparkContext, let its own shutdown hook stop it"