How to deploy scala files used in spark-shell on cluster? - scala

I'm using the spark-shell for learning purpose and for that I created several scala files containing frequently used code, like class definitions. I use the files by calling the ":load" command within the shell.
Now I would like to to use the spark-shell in in yarn-cluster mode. I start it using spark-shell --master yarn --deploy-mode client.
the shell starts without any issues but when I try to run the code loaded by ":load", I get execution errors.
17/05/04 07:59:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1493271022021_0168_01_000002 on host: xxxw03.mine.de. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e68_1493271022021_0168_01_000002
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I think I will have to share the code loaded in the shell to the workers. But how do I have to do this?

The spark-shell is useful for quickly testing but once you have an idea of what you want to do and put together a complete program it's usefulness plummets.
You probably want to now move on to using the spark-submit command.
See the docs on submitting an application https://spark.apache.org/docs/latest/submitting-applications.html
Using this command you provide a JAR file instead of individual class files.
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
<main-class> is the Java style path to your class e.g. com.example.MyMainClass
<application-jar> is the path to the JAR file containing the classes in your project and the other params are as per documented on the link I included above but these two are the two key differences in terms of how you supply your code to the cluster.

Related

spark-submit --py-files gives warning RuntimeWarning: Failed to add file <abc.py> speficied in 'spark.submit.pyFiles' to Python path:

We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?
Note: The cluster is Azure HDI Cluster.
spark-submit --master yarn --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py --files files/<env>.properties,files/<config>.json main.py
warning seen is:
warnings.warn(
/usr/hdp/current/spark3-client/python/pyspark/context.py:256:
RuntimeWarning: Failed to add file
[file:///home/sshuser/project/pyFiles/abc.py] speficied in
'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929
above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)

Spark files not found in cluster deploy mode

I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:
spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties
I'm getting the following error, which states that the file can not be found at the Spark executor working directory:
//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
//This is me getting the error:
20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
at com.someOrg.somePackage.someClass.main(someClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
This is the function I use to attempt to read the properties files passed as arguments:
def loadPropertiesFromFile(path: String): Properties = {
val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
val properties = new Properties()
properties.load(inputStream)
properties
}
Invoked as:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))
The program works as expected when issued with client deploy mode:
spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties
This happens in Spark 2.4.5 / EMR 5.30.1.
Additionally, when I try to configure this job as an EMR step it does not even work in client mode. Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?
Option 1: Put those files in s3 and pass the s3 path.
Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.
Solved with suggestions from the above comments:
spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties
I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.
Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.
It is now working by pointing files location to S3.
Thank you all!

how to read a file present on the edge node when submit spark application in deploy mode = cluster

I have an spark scala application( spark 2.4 ). I am passing a file present on my edge node as an argument to my driver(main) program, I read this file using scala.io.Source .Now when i do a spark-submit and mention --deploy-mode clientthen the application runs fine and it can read the file. But when i use deploy-mode cluster. the application fails saying file not found. Is there a way i can read the file from the edge node in cluster mode.
Thanks.
Edit..
I tried giving file:// before the file path but hat is not working either...
this is how i am giving the file path as an argument to my main class.
spark2-submit --jars spark-avro_2.11-2.4.0.jar --master yarn --deploy-mode cluster --driver-memory 4G --executor-memory 4G --executor-cores 4 --num-executors 6 --conf spark.executor.memoryOverhead=4096 --conf spark.driver.memoryOverhead=4096 --conf spark.executor.instances=150 --conf spark.shuffle.service.enabled=true --class com.citi.gct.main.StartGCTEtl global-consumer-etl-0.0.1-SNAPSHOT-jar-with-dependencies.jar file://home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event dev Y
But still i am getting the same error in cluster mode.
20/05/07 06:27:47 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.io.FileNotFoundException: file:/home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event (No such file or directory)
In cluster mode, you can use the --files option of spark-submit.
Example: https://cloud.ibm.com/docs/services/AnalyticsforApacheSpark?topic=AnalyticsforApacheSpark-specify-file-path
Another option for you is to place the file in a distributed file system like hdfs or dbfs.

Spark Scala Jaas configuration

I’m executing spark code on scala shell using Kafka jars and my intention is to stream messages from Kafka topic. My spark object is created but can anyone help me in how can I pass jaas configuration file while starting the spark shell ? My error point me to missing jaas configuration
Assuming you have a spark-kafka.jaas file in the current folder you are running spark-submit from, you pass it as a file, as well as a Driver and Executor option
spark-submit \
...
--files "spark-kafka.jaas#spark-kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=./spark-kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark-kafka.jaas"
You might also need to set "security.protocol" within the Spark code's Kafka properties to be one of the supported Kafka SASL protocols
I got an issue like yours, I'm using this startup-script to run my spark-shell, I'm using spark 2.3.0.
export HOME=/home/alessio.palma/scala_test
spark2-shell \
--verbose \
--principal hdp_ud_appadmin \
--files "jaas.conf" \
--keytab $HOME/hdp_ud_app.keytab \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0,org.apache.kafka:kafka-clients:0.10.0.1 \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--driver-java-options spark.driver.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--driver-java-options spark.executor.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--queue=root.Global.UnifiedData.hdp_global_ud_app
Any attempt failed with this error:
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
:
.
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Jaas configuration not found
it looks like the park.driver.extraJavaOptions and spark.executor.extraJavaOptions are not working. Anything was failing until I added this row in top of my startup script:
export SPARK_SUBMIT_OPTS='-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf'
And magically the jaas.conf file has been found. Another thing I suggest to add in your startup script is:
export SPARK_KAFKA_VERSION=0.10

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.