What does one enter on the command line to run spark in a bokeh serve app? Do I simply separate the two command line entries by &&? - pyspark

My effort does not work:
/usr/local/spark/spark-2.3.2-bin-hadoop2.7/bin/spark-submit --driver-memory 6g --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.2 runspark.py && bokeh serve --show bokeh_app
runspark.py contains the instantiation of spark, and bokeh_app is the folder of the bokeh server app. spark is being used to update a streaming dask dataframe.
WHAT HAPPENS:
The spark instance starts running, loads as it normally would without the bokeh server. However as soon as the bokeh server app kicks in (i.e.) the web page opens, the spark instance shuts down. It doesn't send back any errors in the console output.
OUTPUT BELOW:
2018-11-26 21:04:05 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4f0492c9{/static/sql,null,AVAILABLE,#Spark}
2018-11-26 21:04:06 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-11-26 21:04:06 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-11-26 21:04:06 INFO AbstractConnector:318 - Stopped Spark#4f3c4272{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2018-11-26 21:04:06 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.1.25:4041
2018-11-26 21:04:06 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-11-26 21:04:06 INFO MemoryStore:54 - MemoryStore cleared
2018-11-26 21:04:06 INFO BlockManager:54 - BlockManager stopped
2018-11-26 21:04:06 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-11-26 21:04:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-11-26 21:04:07 INFO SparkContext:54 - Successfully stopped SparkContext
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Shutdown hook called
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-c42ce0b3-d49e-48ce-962c-277b42166267
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f/pyspark-117d2a10-7cb9-4eb3-b4d0-f92f9046522c
2018-11-26 21:04:08,542 Starting Bokeh server version 0.13.0 (running on Tornado 5.1.1)
2018-11-26 21:04:08,547 Bokeh app running at: http://localhost:5006/aion_analytics
2018-11-26 21:04:08,547 Starting Bokeh server with process id: 10769

Ok, I found the answer. The idea is simply to embed the bokeh server in the pyspark code instead of running the bokeh server from the command line. Use the pyspark submit command as normal.
https://github.com/bokeh/bokeh/blob/1.0.1/examples/howto/server_embed/standalone_embed.py
I did exactly what shown in the link above.

Related

zeppelin spark context closed after one paragraph

I have a notebook in Zeppelin containing multiple paragraphs which were running fine earlier; suddenly, after a cluster restart, it has started behaving weirdly.
The first paragraph runs fine while anything that runs afterwards says Connection Refused.
On checking the logs in $ZEPPELIN_HOME/logs folder zeppelin-interpreter-spark-root-mn.log (where mn is machine name).
INFO [2018-02-21 21:42:43,301] ({dispatcher-event-loop-15} Logging.scala[logInfo]:54) - Removed broadcast_12_piece0 on mn5:45284 in memory (size: 88.2 KB, free: 2004.5 MB)
INFO [2018-02-21 21:42:43,401] ({Thread-3} Logging.scala[logInfo]:54) - Invoking stop() from shutdown hook
INFO [2018-02-21 21:42:43,412] ({Thread-3} AbstractConnector.java[doStop]:310) - Stopped Spark#7de3e842{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
INFO [2018-02-21 21:42:43,416] ({Thread-3} Logging.scala[logInfo]:54) - Stopped Spark web UI at http://10.28.37.82:4040
INFO [2018-02-21 21:42:43,440] ({Yarn application state monitor} Logging.scala[logInfo]:54) - Interrupting monitor thread
INFO [2018-02-21 21:42:43,442] ({Thread-3} Logging.scala[logInfo]:54) - Shutting down all executors
INFO [2018-02-21 21:42:43,443] ({dispatcher-event-loop-4} Logging.scala[logInfo]:54) - Asking each executor to shut down
INFO [2018-02-21 21:42:43,447] ({Thread-3} Logging.scala[logInfo]:54) - Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
INFO [2018-02-21 21:42:43,450] ({Thread-3} Logging.scala[logInfo]:54) - Stopped
INFO [2018-02-21 21:42:43,454] ({dispatcher-event-loop-9} Logging.scala[logInfo]:54) - MapOutputTrackerMasterEndpoint stopped!
INFO [2018-02-21 21:42:43,466] ({Thread-3} Logging.scala[logInfo]:54) - MemoryStore cleared
INFO [2018-02-21 21:42:43,466] ({Thread-3} Logging.scala[logInfo]:54) - BlockManager stopped
INFO [2018-02-21 21:42:43,467] ({Thread-3} Logging.scala[logInfo]:54) - BlockManagerMaster stopped
INFO [2018-02-21 21:42:43,471] ({dispatcher-event-loop-0} Logging.scala[logInfo]:54) - OutputCommitCoordinator stopped!
INFO [2018-02-21 21:42:43,472] ({Thread-3} Logging.scala[logInfo]:54) - Successfully stopped SparkContext
INFO [2018-02-21 21:42:43,473] ({Thread-3} Logging.scala[logInfo]:54) - Shutdown hook called
So the shut down hook is getting called. I have tried to check other posts on SO (like this and this) but it didn't help. Logs are not much helpful either.
Do I need to tweak code to add additional logging to fix this problem? Has someone has already faced and resolved the same?
It turned out to be the case of bad logging. I had checked yarn logs as well but couldn't find anything. It turns out that second paragraph had a RunTimeException which wasn't clear from any of the logs, but when i tried same command on spark-shell then i Realized what the problem was and fixed the same.
Run the scala command in spark-shell then see what exception it is throwing.

Hortonworks, Eclipse and Kerberos Client (Authentication, HOW?)

Hello everybody, we have a kerberized HDP (Hortonworks) cluster, we can run Spark jobs from Spark-Submit (CLI), Talend Big Data, but not from Eclipse.
We have a Windows client machine where Eclipse is installed and MIT windows Kerberos Client is confgiured (TGT Configuration). The goal is to run Spark job using eclipse. Portion of the java code related with Spark is operational and tested via CLI. Below is mentioned part of the code for the job.
private void setConfigurationProperties()
{
try{
sConfig.setAppName("abcd-name");
sConfig.setMaster("yarn-client");
sConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sConfig.set("spark.hadoop.yarn.resourcemanager.address", "rs.abcd.com:8032"); sConfig.set("spark.hadoop.yarn.resourcemanager.scheduler.address","rs.abcd.com:8030");
sConfig.set("spark.hadoop.mapreduce.jobhistory.address","rs.abcd.com:10020");
sConfig.set("spark.hadoop.yarn.app.mapreduce.am.staging-dir", "/dir");
sConfig.set("spark.executor.memory", "2g");
sConfig.set("spark.executor.cores", "4");
sConfig.set("spark.executor.instances", "24");
sConfig.set("spark.yarn.am.cores", "24");
sConfig.set("spark.yarn.am.memory", "16g");
sConfig.set("spark.eventLog.enabled", "true");
sConfig.set("spark.eventLog.dir", "hdfs:///spark-history");
sConfig.set("spark.shuffle.memoryFraction", "0.4");
sConfig.set("spark.hadoop." + "mapreduce.application.framework.path","/hdp/apps/version/mapreduce/mapreduce.tar.gz#mr-framework");
sConfig.set("spark.local.dir", "/tmp");
sConfig.set("spark.hadoop.yarn.resourcemanager.principal", "rm/_HOST#ABCD.COM");
sConfig.set("spark.hadoop.mapreduce.jobhistory.principal", "jhs/_HOST#ABCD.COM");
sConfig.set("spark.hadoop.dfs.namenode.kerberos.principal", "nn/_HOST#ABCD.COM");
sConfig.set("spark.hadoop.fs.defaultFS", "hdfs://hdfs.abcd.com:8020");
sConfig.set("spark.hadoop.dfs.client.use.datanode.hostname", "true"); }
}
When we run the code the following error pops up:
17/04/05 23:37:06 INFO Remoting: Starting remoting
17/04/05 23:37:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#1.1.1.1:54356]
17/04/05 23:37:06 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54356.
17/04/05 23:37:06 INFO SparkEnv: Registering MapOutputTracker
17/04/05 23:37:06 INFO SparkEnv: Registering BlockManagerMaster
17/04/05 23:37:06 INFO DiskBlockManager: Created local directory at C:\tmp\blockmgr-baee2441-1977-4410-b52f-4275ff35d6c1
17/04/05 23:37:06 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
17/04/05 23:37:06 INFO SparkEnv: Registering OutputCommitCoordinator
17/04/05 23:37:07 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/04/05 23:37:07 INFO SparkUI: Started SparkUI at http://1.1.1.1:4040
17/04/05 23:37:07 INFO RMProxy: Connecting to ResourceManager at rs.abcd.com/1.1.1.1:8032
17/04/05 23:37:07 ERROR SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
17/04/05 23:37:07 INFO SparkUI: Stopped Spark web UI at http://1.1.1.1:4040
Please guide us how to specify in java code Kerberos authentication method instead of SIMPLE. Or how to instruct the client for Kerberos authentication request. And whole what should the process look like and what would be the right approach
Thank you

Spark RDD method "saveAsTextFile" throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

I am calling this method on an RDD[String] with destination in the arguments. (Scala)
Even after deleting the directory before starting, the process gives this error.
I am running this process on EMR cluster with output location at aws S3.
Below is the command used:
spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g --driver-cores 4 s3://bi-aws-users/sbatheja/hotel-shopper-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d 3 -p 100 --search-bucket s3a://hda-prod-business.hotwire.hotel.search --prd-output-path s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput/
Log:
16/07/07 11:27:47 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/07 11:27:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/07 11:27:47 INFO SparkContext: Successfully stopped SparkContext
16/07/07 11:27:47 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: **org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput already exists)**
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/07 11:27:47 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/07/07 11:27:47 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1467889642439_0001
16/07/07 11:27:47 INFO ShutdownHookManager: Shutdown hook called
16/07/07 11:27:47 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1467889642439_0001/spark-7f836950-a040-4216-9308-2bb4565c5649
It creates "_temporary" directory in the location, which contains empty part files.
In short, a word:
Make sure the scala version of spark-core and scala-library is consistent.
I encountered the same problem.
As I saving the file to the HDFS, it throws an exception: org.apache.hadoop.mapred.FileAlreadyExistsException
Then I checked the HDFS file directory, there is a empty temporary folder: TARGET_DIR/_temporary/0.
You can submit the job, open the detailed configuration:./spark-submit --verbose.
And then look at the full context and log, there must be other errors caused.
My job in the RUNNING state, the first error is thrown:
17/04/23 11:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
Then the job will be retried and re-executed. At this time, job re-implementation, it will find just the directory has been created. And also throws the directory already exists.
After confirming that the first error is version compatibility issues.
The spark version is 2.1.0, the corresponding spark-core scala version is 2.11, and the scala-library dependency of the scala version is 2.12.xx.
When the two scala version of the change is consistent (usually modify the scala-library version), you can solve the first exception problem, then job can be normal FINISHED.
pom.xml example:
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.7</version>
</dependency>

Unable to connect to Spark master

I start my DataStax cassandra instance with Spark:
dse cassandra -k
I then run this program (from within Eclipse):
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Start {
def main(args: Array[String]): Unit = {
println("***** 1 *****")
val sparkConf = new SparkConf().setAppName("Start").setMaster("spark://127.0.0.1:7077")
println("***** 2 *****")
val sparkContext = new SparkContext(sparkConf)
println("***** 3 *****")
}
}
And I get the following output
***** 1 *****
***** 2 *****
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/29 15:27:50 INFO SparkContext: Running Spark version 1.5.2
15/12/29 15:27:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/29 15:27:51 INFO SecurityManager: Changing view acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: Changing modify acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nayan); users with modify permissions: Set(nayan)
15/12/29 15:27:52 INFO Slf4jLogger: Slf4jLogger started
15/12/29 15:27:52 INFO Remoting: Starting remoting
15/12/29 15:27:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.0.1.88:55126]
15/12/29 15:27:53 INFO Utils: Successfully started service 'sparkDriver' on port 55126.
15/12/29 15:27:53 INFO SparkEnv: Registering MapOutputTracker
15/12/29 15:27:53 INFO SparkEnv: Registering BlockManagerMaster
15/12/29 15:27:53 INFO DiskBlockManager: Created local directory at /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/blockmgr-21a96671-c33e-498c-83a4-bb5c57edbbfb
15/12/29 15:27:53 INFO MemoryStore: MemoryStore started with capacity 983.1 MB
15/12/29 15:27:53 INFO HttpFileServer: HTTP File server directory is /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/spark-fce0a058-9264-4f2c-8220-c32d90f11bd8/httpd-2a0efcac-2426-49c5-982a-941cfbb48c88
15/12/29 15:27:53 INFO HttpServer: Starting HTTP Server
15/12/29 15:27:53 INFO Utils: Successfully started service 'HTTP file server' on port 55127.
15/12/29 15:27:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/29 15:27:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/29 15:27:53 INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
15/12/29 15:27:54 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/29 15:27:54 INFO AppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
15/12/29 15:27:54 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#127.0.0.1:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/12/29 15:28:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1f22aef0 rejected from java.util.concurrent.ThreadPoolExecutor#176cb4af[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
So something is happening during the creation of the spark context.
When i look in $DSE_HOME/logs/spark, it is empty. Not sure where else to look.
It turns out that the problem was the spark library version AND the Scala version. DataStax was running Spark 1.4.1 and Scala 2.10.5, while my eclipse project was using 1.5.2 & 2.11.7 respectively.
Note that BOTH the Spark library and Scala appear to have to match. I tried other combinations, but it only worked when both matched.
I am getting pretty familiar with this part of your posted error:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://...
It can have numerous causes, pretty much all related to misconfigured IPs. First I would do whatever zero323 says, then here's my two cents: I have solved my own problems recently by using IP addresses, not hostnames, and the only config I use in a simple standalone cluster is SPARK_MASTER_IP.
SPARK_MASTER_IP in the $SPARK_HOME/conf/spark-env.sh on your master then should lead the master webui to show the IP address you set:
spark://your.ip.address.numbers:7077
And your SparkConf setup can refer to that.
Having said that, I am not familiar with your specific implementation but I notice in the error two occurrences containing:
/private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/
Have you looked there to see if there's a logs directory? Is that where $DSE_HOME points? Alternatively connect to the driver where it creates it's webui:
INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
and you should see a link to an error log there somewhere.
More on the IP vs. hostname thing, this very old bug is marked as Resolved but I have not figured out what they mean by Resolved, so I just tend toward IP addresses.

There is a HTTP server starts when Launching Spark jar on a machine, what's that?

I want to use machine A where I will submit my Spark job to the cluster, A has no spark environment, just java. When I launch the jar, there is a HTTP server starts:
[steven#bj-230 ~]$ java -jar helloCluster.jar SimplyApp
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/06/10 16:54:54 INFO SparkEnv: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/06/10 16:54:54 INFO SparkEnv: Registering BlockManagerMaster
14/06/10 16:54:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140610165454-4393
14/06/10 16:54:54 INFO MemoryStore: MemoryStore started with capacity 1055.1 MB.
14/06/10 16:54:54 INFO ConnectionManager: Bound socket to port 59981 with id = ConnectionManagerId(bj-230,59981)
14/06/10 16:54:54 INFO BlockManagerMaster: Trying to register BlockManager
14/06/10 16:54:54 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager bj-230:59981 with 1055.1 MB RAM
14/06/10 16:54:54 INFO BlockManagerMaster: Registered BlockManager
14/06/10 16:54:54 INFO HttpServer: Starting HTTP Server
14/06/10 16:54:54 INFO HttpBroadcast: Broadcast server started at http://10.10.10.230:59233
14/06/10 16:54:54 INFO SparkEnv: Registering MapOutputTracker
14/06/10 16:54:54 INFO HttpFileServer: HTTP File server directory is /tmp/spark-bfdd02f1-3c02-4233-854f-af89542b9acf
14/06/10 16:54:54 INFO HttpServer: Starting HTTP Server
14/06/10 16:54:54 INFO SparkUI: Started Spark Web UI at http://bj-230:4040
14/06/10 16:54:54 INFO SparkContext: Added JAR hdfs://master:8020/tmp/helloCluster.jar at hdfs://master:8020/tmp/helloCluster.jar with timestamp 1402390494838
14/06/10 16:54:54 INFO AppClient$ClientActor: Connecting to master spark://master:7077...
So, what's the meaning of this server? And if I am behind a NAT, is it possible to use this machine A to submit my job to remote cluster?
By the way, the result of this execution is failed. Error log:
14/06/10 16:55:05 INFO SparkDeploySchedulerBackend: Executor app-20140610165321-0005/7 removed: Command exited with code 1
14/06/10 16:55:05 ERROR AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/10 16:55:05 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/10 16:55:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
The spark driver starts few HTTP endpoints:
It provides a Web console that shows the job progress. This http endpoint has a default port of 4040 and can be changed with the configuration option: spark.ui.port. Then, you connect to it with your browser: http://your_host:4040 and you will be able to follow the job. It's only alive the time the driver runs.
There's an additional HTTP endpoint to provide a file download service for the jars declared as dependencies. The workers will contact the driver to download the list of dependencies. This is a random assigned port. Therefore, the driver must be on a routable network from the Spark workers.