spark-submit gets idle in local mode - kerberos

I am trying to test a jar using spark-submit (Spark 1.6.0) into a Cloudera cluster, which has Kerberos enabled.
The fact is that if I launch this command:
spark-submit --master local --class myDriver myApp.jar -c myConfig.conf
In local or local[*], the process stops after a couple of stages. However, if I use yarn-client or yarn-cluster master modes the process ends correctly. The process reads and writes some files into HDFS.
Furthermore, these traces appear:
17/07/05 16:12:51 WARN spark.SparkContext: Requesting executors is only supported in coarse-grained mode
17/07/05 16:12:51 WARN spark.ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!
It is surely a matter of configuration, but the fact is that I don't know what is happenning. Any ideas? What configuration options should I change?

Related

Memory Allocation In Spark-Scala Application:

I am executing spark-Scala job using Spark submit command. I have written my code in spark sql where i am joining 2 tables and loading data again in 3rd hive.
code is working fine,But sometimes i am getting some issue like OutofmemoryIssue: Java heap size issue,Timeout error.
So i want to control my job manually by passing number of executors, cores and memory.When i used 16 executor,1 core and 20 GB executor memory my spark application is getting stuck.
can someone please suggest me how should i control manually my spark application by providing correct parameter.and is there any other hive or spark specific parameter are there which i can use for fast execution.
below is configuration of my cluster.
Number of Nodes: 5
Number of Cores per Node: 6
RAM per Node: 125 gb
Spark Submit Command.
spark-submit --class org.apache.spark.examples.sparksc \
--master yarn-client \
--num-executors 16 \
--executor-memory 20g \
--executor-cores 1 \
examples/jars/spark-examples.jar
It depends on volume of your data. you can make dynamic parameters. This link has very nice explanation
How to tune spark executor number, cores and executor memory?
you can enable spark.shuffle.service.enabled, use spark.sql.shuffle.partitions=400, hive.exec.compress.intermediate=true, hive.exec.reducers.bytes.per.reducer=536870912, hive.exec.compress.output=true, hive.output.codec=snappy, mapred.output.compression.type=BLOCK
if your data >700MB you can enable spark.speculation properties

Exceptions while running Spark job on EMR cluster "java.io.IOException: All datanodes are bad"

We have AWS EMR setup to process jobs which are written in Scala. We are able to run the jobs on small dataset, but while running same job on large dataset I get exception "java.io.IOException: All datanodes are bad."
Setting spark.shuffle.service.enabled to true resolved this issue for me.
The default configuration of AWS EMR has set spark.dynamicAllocation.enabled to true, but spark.shuffle.service.enabled is set to false.
spark.dynamicAllocation.enabled allows Spark to assign the executors dynamically to different task. The spark.shuffle.service.enabled when set to false disables the external shuffle service and data is stored only on executors. When the executors is reassigned the data is lost and the exception "java.io.IOException: All datanodes are bad." is thrown for data request.

Launching Spark job with Oozie fails (Error MetricsSystem)

I have a spark jar that I launch with spark-submit and it works fine (reading files, generate RDD, storing in hdfs). However, when I tried to launch the same jar within an Oozie job (oozie:spark-action) the spark job fails.
When I looked the logs, the first error to shows up is :
Error MetricsSystem: Sink class
org.apache.spark.metrics.sink.MetricsServlet cannot be instantiated.
Furthermore, when I started playing with the spark script, I found out that the problem has to do with saveAsText funtion. When I lunch the same spark job without writing to HDFS the whole workflow works fine.
Any suggestions ?
The problem was in the side of the cluste where i am executing oozie jobs.
I needed to explicitely add arguments in the job workflow, simply because they weren't taken into consideration:
<spark-opts>--queue HQ_IBNF --conf "spark.executor.extraJavaOptions=-Djava.library.path=/opt/application/Hadoop/current/lib/native"</spark-opts>

How to use yarn to run a self-contained Spark app remotely

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through yarn.
I need my Spark job to load a hdfs file located on a hadoop cluster not accessible directly through my local machine. So, I create SOCKS proxy through ssh tunnel by including these properties in hdfs-site.xml.
<property>
<name>hadoop.socks.server</name>
<value>localhost:7070</value>
</property>
<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader</name>
<value>true</value>
</property>
where 7070 is the dynamic port to the hadoop gateway machine.
ssh -fCND 7070 <hadoop-gateway-machine>
This allows me to access hdfs files locally, when I am using Spark in local[*] master configuration for testing.
However, when I run a real Spark job on yarn deployed on the same hadoop cluster (configured by yarn-site.xml, hdfs-site.xml, and core-site.xml in the classpath), I see errors like:
java.lang.IllegalStateException: Library directory '<project-path>/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
So, I set spark.yarn.jars property directly on sparkConf. This at least starts a yarn application. When I go the application url I just keep seeing this message in one of the worker logs:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
And this message in one more hadoop worker log (apparently the Spark worker that could not connect to driver)
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:187)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:653)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:651)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
My question is, what is the right way of running self-contained Spark apps on a yarn cluster. How do you do it so you don't have to specify spark.yarn.jars and other properties? Should you include spark-defaults.conf in the classpath as well?

Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.
Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.