Spark Scala Jaas configuration - scala

I’m executing spark code on scala shell using Kafka jars and my intention is to stream messages from Kafka topic. My spark object is created but can anyone help me in how can I pass jaas configuration file while starting the spark shell ? My error point me to missing jaas configuration

Assuming you have a spark-kafka.jaas file in the current folder you are running spark-submit from, you pass it as a file, as well as a Driver and Executor option
spark-submit \
...
--files "spark-kafka.jaas#spark-kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=./spark-kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark-kafka.jaas"
You might also need to set "security.protocol" within the Spark code's Kafka properties to be one of the supported Kafka SASL protocols

I got an issue like yours, I'm using this startup-script to run my spark-shell, I'm using spark 2.3.0.
export HOME=/home/alessio.palma/scala_test
spark2-shell \
--verbose \
--principal hdp_ud_appadmin \
--files "jaas.conf" \
--keytab $HOME/hdp_ud_app.keytab \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0,org.apache.kafka:kafka-clients:0.10.0.1 \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--driver-java-options spark.driver.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--driver-java-options spark.executor.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--queue=root.Global.UnifiedData.hdp_global_ud_app
Any attempt failed with this error:
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
:
.
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Jaas configuration not found
it looks like the park.driver.extraJavaOptions and spark.executor.extraJavaOptions are not working. Anything was failing until I added this row in top of my startup script:
export SPARK_SUBMIT_OPTS='-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf'
And magically the jaas.conf file has been found. Another thing I suggest to add in your startup script is:
export SPARK_KAFKA_VERSION=0.10

Related

Upgrading to spark_Cassandra_Connector_2_12:3.2.0 from spark_Cassandra_connector_2_11:2.5.1 is breaking

My code in production is working good over an year with
spark_Cassandra_connector_2_11:2.5.1 with EMR 5.29
We want to upgrade to scala 2_12 and EMR 6.6.0 and did respective scala upgrade
spark - 3.2.0
datastax_cassandra_connector - 3.2.0
scala - 2.12.11
EMR - 6.6.0
No config on the class path and just the below config passed to spark
--conf spark.cassandra.connection.host=${casHostIps} \
--conf spark.cassandra.auth.username=${casUser} \
--conf spark.cassandra.auth.password=${casPassword} \
--conf spark.cassandra.output.ignoreNulls=true \
--conf spark.cassandra.output.concurrent.writes=$SPARK_CASSANDRA_OUTPUT_CONCURRENT_WRITES \
Upgraded code is no more working in new EMR with the above config options. Always complaining the below
Caused by: com.typesafe.config.ConfigException$Missing: withValue(advanced.reconnection-policy.base-delay): No configuration setting found for key 'advanced.session-leak'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:157)
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:177)
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:181)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:194)
at com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:224)
at com.typesafe.config.impl.SimpleConfig.getInt(SimpleConfig.java:235)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverExecutionProfile.getCached(TypesafeDriverExecutionProfile.java:291)
at com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverExecutionProfile.getInt(TypesafeDriverExecutionProfile.java:88)
at com.datastax.oss.driver.internal.core.session.DefaultSession.<init>(DefaultSession.java:103)
at com.datastax.oss.driver.internal.core.session.DefaultSession.init(DefaultSession.java:88)
at com.datastax.oss.driver.api.core.session.SessionBuilder.buildDefaultSessionAsync(SessionBuilder.java:903)
at com.datastax.oss.driver.api.core.session.SessionBuilder.buildAsync(SessionBuilder.java:817)
at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:835)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createSession(CassandraConnectionFactory.scala:143)
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:167)
where I have ended up referring to datastax_java_driver config in my classpath and it is working finally with https://docs.datastax.com/en/developer/java-driver/4.14/manual/core/configuration/reference/
--conf spark.files=s3://${efSparkJarBucket}/${efSparkJarPrefix}reference.conf \
--conf spark.cassandra.connection.config.profile.path=reference.conf \
But there is not much tuning options to do in java-driver level that I do at spark level .ex:
--conf spark.cassandra.output.concurrent.writes=100 \
2 Questions:
why 3.2.0 spark_cassandra_Driver is a breaking change compared to 2.5.1?
if referring to a config (Reference.conf) is the only option, How I can tune the throttling as my load is happening but very slow, and also cassandra cluster nodes are going out and in.

Spark files not found in cluster deploy mode

I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:
spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties
I'm getting the following error, which states that the file can not be found at the Spark executor working directory:
//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
//This is me getting the error:
20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
at com.someOrg.somePackage.someClass.main(someClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
This is the function I use to attempt to read the properties files passed as arguments:
def loadPropertiesFromFile(path: String): Properties = {
val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
val properties = new Properties()
properties.load(inputStream)
properties
}
Invoked as:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))
The program works as expected when issued with client deploy mode:
spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties
This happens in Spark 2.4.5 / EMR 5.30.1.
Additionally, when I try to configure this job as an EMR step it does not even work in client mode. Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?
Option 1: Put those files in s3 and pass the s3 path.
Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.
Solved with suggestions from the above comments:
spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties
I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.
Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.
It is now working by pointing files location to S3.
Thank you all!

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.

Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

I am working with following spark config
maxCores = 5
driverMemory=2g
executorMemory=17g
executorInstances=100
Issue:
Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.
Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called
Not sure why yarn is killing my executors.
I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause.
You can access them via the Web-interface
nodeManagerAddress:PORT/logs
The PORT is specified in the yarn-site.xml under yarn.nodemanager.webapp.address. (default: 8042)
My Investigation-Workflow:
Collect logs (yarn logs ... command)
Identify node and container (in these logs) emitting the error
Search the NodeManager-logs by Timestamp of the error for a root cause
Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:
nodeManagerAdress:PORT/conf
I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.
One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.
spark.dynamicAllocation.executorIdleTimeout
This thread has more detailed information on how to resolve this issue which worked for me:
https://jira.apache.org/jira/browse/SPARK-21733
I had the same issue, my spark job was using only 1 task node and killing the other provisioned nodes. This also happened when switching to EMR Serverless, my job was being run on only one "thread". Please see below as it fixed it for me:
spark-submit \
--name KSSH-0.3 \
--class com.jiuye.KSSH \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
--num-executors 8 \
--jars $(echo /opt/software/spark2.1.1/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.yarn.am.memory=1024m" \
--conf "spark.yarn.am.memoryOverhead=1024m" \
--conf "spark.yarn.driver.memoryOverhead=1024m" \
--conf "spark.yarn.executor.memoryOverhead=1024m" \
--conf "spark.yarn.am.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1250" \
--conf "spark.locality.wait=1s" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.executor.heartbeatInterval=360000" \
--conf "spark.network.timeout=420000" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" \
/opt/software/spark2.1.1/spark_on_yarn/KSSH-0.3.jar

Why spark-submit fails with `spark.yarn.stagingDir` with master yarn and deploy-mode cluster

I came across a scenario when I supply spark.yarn.stagingDir to spark-submit it starts failing and it doesn't give any clue about the rootcause, and I spent quite long time to figure out it's because of spark.yarn.stagingDir parameter. Why spark-submit fails when supply spark.yarn.stagingDir this parameter?
Check related question here for more details
Command which fails:
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar
When I remove spark.yarn.stagingDir, it starts working:
spark-submit \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar
Exception stacktrace:
Application application_1506717704791_145448 finished with failed
status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
I've encountered exactly the same problem when I set spark.yarn.stagingDir as /tmp (while it worked fine once I removed this very configuration entry).
My solution is to specify the full HDFS path like hdfs://hdfs_server_name/tmp instead of merely /tmp. Hope it works for you.