I am getting this error from airflow:
java.sql.SQLException: No suitable driver
The same code works fine in Oozie. The command to run this job in airflow is below.
I added this bold line but still no luck.
/usr/hdp/current/spark-client/bin/spark-submit --master yarn --deploy-mode cluster --queue batch --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --files /usr/hdp/current/spark-client/conf/hive-site.xml --class com.bcp.test.SimpleBcpJob --executor-memory 1G hdfs://<server_name>/edw/libs/bcptest.jar **hdfs://<server_name>/edw/libs/sqljdbc41.jar** ${nameNode}/edw/data.properties ${nameNode}/edw/test/simple-bcp-job.properties reload
I'd appreciate any suggestions.
Related
We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?
Note: The cluster is Azure HDI Cluster.
spark-submit --master yarn --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py --files files/<env>.properties,files/<config>.json main.py
warning seen is:
warnings.warn(
/usr/hdp/current/spark3-client/python/pyspark/context.py:256:
RuntimeWarning: Failed to add file
[file:///home/sshuser/project/pyFiles/abc.py] speficied in
'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929
above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)
I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:
spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties
I'm getting the following error, which states that the file can not be found at the Spark executor working directory:
//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
//This is me getting the error:
20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
at com.someOrg.somePackage.someClass.main(someClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
This is the function I use to attempt to read the properties files passed as arguments:
def loadPropertiesFromFile(path: String): Properties = {
val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
val properties = new Properties()
properties.load(inputStream)
properties
}
Invoked as:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))
The program works as expected when issued with client deploy mode:
spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties
This happens in Spark 2.4.5 / EMR 5.30.1.
Additionally, when I try to configure this job as an EMR step it does not even work in client mode. Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?
Option 1: Put those files in s3 and pass the s3 path.
Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.
Solved with suggestions from the above comments:
spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties
I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.
Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.
It is now working by pointing files location to S3.
Thank you all!
I have an spark scala application( spark 2.4 ). I am passing a file present on my edge node as an argument to my driver(main) program, I read this file using scala.io.Source .Now when i do a spark-submit and mention --deploy-mode clientthen the application runs fine and it can read the file. But when i use deploy-mode cluster. the application fails saying file not found. Is there a way i can read the file from the edge node in cluster mode.
Thanks.
Edit..
I tried giving file:// before the file path but hat is not working either...
this is how i am giving the file path as an argument to my main class.
spark2-submit --jars spark-avro_2.11-2.4.0.jar --master yarn --deploy-mode cluster --driver-memory 4G --executor-memory 4G --executor-cores 4 --num-executors 6 --conf spark.executor.memoryOverhead=4096 --conf spark.driver.memoryOverhead=4096 --conf spark.executor.instances=150 --conf spark.shuffle.service.enabled=true --class com.citi.gct.main.StartGCTEtl global-consumer-etl-0.0.1-SNAPSHOT-jar-with-dependencies.jar file://home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event dev Y
But still i am getting the same error in cluster mode.
20/05/07 06:27:47 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.io.FileNotFoundException: file:/home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event (No such file or directory)
In cluster mode, you can use the --files option of spark-submit.
Example: https://cloud.ibm.com/docs/services/AnalyticsforApacheSpark?topic=AnalyticsforApacheSpark-specify-file-path
Another option for you is to place the file in a distributed file system like hdfs or dbfs.
Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.
Deployed a Hadoop (Yarn + Spark) cluster on Google Compute Engine with one master & two slaves. When I run the following shell script:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 /home/hadoop/spark-install/lib/spark-examples-1.1.0-hadoop2.4.0.jar 10
the job just keeps running & every second I get a message similar to this:
15/02/06 22:47:12 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1423247324488_0008<br>
appId: 8<br>
clientToAMToken: null<br>
appDiagnostics:<br>
appMasterHost: hadoop-w-zrem.c.myapp.internal<br>
appQueue: default<br>
appMasterRpcPort: 0<br>
appStartTime: 1423261517468<br>
yarnAppState: RUNNING<br>
distributedFinalState: UNDEFINED<br>
appTrackingUrl: http://hadoop-m-xxxx:8088/proxy/application_1423247324488_0008/<br>
appUser: achitre
Instead of --master yarn-cluster use --master yarn-client
After adding following line to my script, it worked:
export SPARK_JAVA_OPTS="-Dspark.yarn.executor.memoryOverhead=1024 -Dspark.local.dir=/tmp -Dspark.executor.memory=1024"
I guess, we shouldn't use 'm', 'g' etc when specifying memory; otherwise we get NumberFormatException.