ERROR: java.lang.IllegalStateException: User did not initialize spark context - scala

Scala version: 2.11.12
Spark version: 2.4.0
emr-5.23.0
Get the following when running the below command to create an Amazon EMR cluster
spark-submit --class etl.SparkDataProcessor --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.ETL_NAME=foo --conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn --conf spark.yarn.appMasterEnv.ETL_AWS_ACCESS_KEY_ID=123 --conf spark.yarn.appMasterEnv.ETL_AWS_SECRET_ACCESS_KEY=abc MY-Tool.jar
Exception
ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
How I create my spark session (where sparkMaster = yarn)
lazy val spark: SparkSession = {
val logger: Logger = Logger.getLogger("etl");
val sparkAppName = EnvConfig.ETL_NAME
val sparkMaster = EnvConfig.ETL_SPARK_MASTER
val sparkInstance = SparkSession
.builder()
.appName(sparkAppName)
.master(sparkMaster)
.getOrCreate()
val hadoopConf = sparkInstance.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.access.key", EnvConfig.ETL_AWS_ACCESS_KEY_ID)
hadoopConf.set("fs.s3a.secret.key", EnvConfig.ETL_AWS_SECRET_ACCESS_KEY)
logger.info("Created My SparkSession")
logger.info(s"Spark Application Name: $sparkAppName")
logger.info(s"Spark Master: $sparkMaster")
sparkInstance
}
UPDATE:
I determined that due to the application logic, in certain cases, we did not initialize the spark session. Because of this, it seems that when the cluster terminates, it also tries to do something with the session (perhaps close it) and is thus failing. Now that I have figured out this issue, the application runs but never actually completes. Currently, it seems to be hanging in a particular part involving spark when running in cluster mode:
val data: DataFrame = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(s"s3://$csvPath/$fileKey")
.toDF()
20/03/16 18:38:35 INFO Client: Application report for application_1584324418613_0031 (state: RUNNING)

AFAIK EnvConfig.ETL_AWS_ACCESS_KEY_ID and ETL_AWS_SECRET_ACCESS_KEY are not getting populated due to which sparksession cant be instanciated with null or empty values . try to print and debug the values.
also reading the properties from --conf spark.xxx
should be like this example. I hope you are following this...
spark.sparkContext.getConf.getOption("spark. ETL_AWS_ACCESS_KEY_ID")
once you check that, this example way should work...
/**
* Hadoop-AWS Configuration
*/
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.host", proxyHost)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.port", proxyPort)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem
another thing is, use
--master yarn or --master local[*] you can use instead of
-conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn
UPDATE :
--conf spark.driver.port=20002 may solve this issue. where 20002 is orbitary port.. seems like its waiting for the particular port for some time and its retrying for some time and its failing with the exception you got.
I got this idea by walking through the Sparks application master code from here
and comment This a bit hacky, but we need to wait until the spark.driver.port property has been set by the Thread executing the user class.
you can try this and let me know.
Further reading : Apache Spark : How to change the port the Spark driver listens to

In my case (after resolving the application issues), I needed to include core AND task node types when deploying in cluster mode.

Related

Spark Structured Streaming terminates immediately with spark-submit

I am trying to set up an ingestion pipeline using Spark structured streaming to read from Kafka and write to a Delta Lake table. I currently have a basic POC that I am trying to get running, no transformations yet. When working in the spark-shell, everything seems to run fine:
spark-shell --master spark://HOST:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,io.delta:delta-core_2.12:1.1.0
Starting and writing the stream:
val source = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "http://HOST:9092").option("subscribe", "spark-kafka-test").option("startingOffsets", "earliest").load().writeStream.format("delta").option("checkpointLocation", "/tmp/delta/checkpoint").start("/tmp/delta/delta-test")
However, once I pack this in to a Scala application and spark-submit the class with the required packages in a sbt assembly jar to the standalone spark instance, the stream seems to stop immediately and does not process any messages in the topic. I simply get the following logs:
INFO SparkContext: Invoking stop() from shutdown hook
...
INFO SparkContext: Successfully stopped SparkContext
INFO MicroBatchExecution: Resuming at batch 0 with committed offsets {} and available offsets {KafkaV2[Subscribe[spark-kafka-test]]: {"spark-kafka-test":{"0":6}}}
INFO MicroBatchExecution: Stream started from {}
Process finished with exit code 0
Here is my Scala class:
import org.apache.spark.sql.SparkSession
object Consumer extends App {
val spark = SparkSession
.builder()
.appName("Spark Kafka Consumer")
.master("spark://HOST:7077")
//.master("local")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.executor.memory", "1g")
.config("spark.executor.cores", "2")
.config("spark.cores.max", "2")
.getOrCreate()
val source = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "http://HOST:9092")
.option("subscribe", "spark-kafka-test")
.option("startingOffsets", "earliest")
.load()
.writeStream
.format("delta")
.option("checkpointLocation", "/tmp/delta/checkpoint")
.start("/tmp/delta/delta-test")
}
Here is my spark-submitcommand:
spark-submit --master spark://HOST:7077 --deploy-mode client --class Consumer --name Kafka-Delta-Consumer --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,io.delta:delta-core_2.12:1.1.0 <PATH-TO-JAR>/assembly.jar
Does anybody have an idea why the stream is closed and the program terminates? I am assuming memory is not a problem, as the whole Kafka topic is only a few bytes.
EDIT:
From some further investigations, I found the following behavior: On my confluent hub interface, I see that starting the stream via the spark-shell registers a consumer and active consumption is visible in monitoring.
On contrast, the spark-submit job is seemingly not able to register the consumer. On the driver logs, I found the following error:
WARN org.apache.spark.sql.kafka010.KafkaOffsetReaderConsumer - Error in attempt 1 getting Kafka offsets:
java.lang.NullPointerException
at org.apache.spark.kafka010.KafkaConfigUpdater.setAuthenticationConfigIfNeeded(KafkaConfigUpdater.scala:60)
In my case, I am working with one master and one worker on the same machine. There shouldn't be any networking differences between spark-shell and spark-submit executions, am I right?

pyspark dataframe error due to java.lang.ClassNotFoundException: org.postgresql.Driver

I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.

receiving sparkcontext error while initializing sparksession

I am receiving sparkcontext error while starting sparksession on EMR 5.3.1 in scala. Below is the version of spark I am using and the error.
This works fine on windows machine but errors out on EMR. Also is this the right way to create sparksession?
spark_version: 2.1.0
val spark = SparkSession
.builder
.master("local[*]")
.appName("vierweship_test")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
//.enableHiveSupport()
.getOrCreate()
Error:
17/08/01 13:08:14 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
I get below error if i do not use warehouse directory properties.
17/08/01 13:00:02 ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException
at java.io.File.<init>(File.java:277)
at org.apache.spark.deploy.yarn.Client.addDistributedUri$1(Client.scala:438)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:476)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:600)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
Command I am using :
spark-submit --verbose --class xxx --master yarn --jars="s3-dist-cp.jar:common-0.1.jar" --deploy-mode client --packages "Xxx:xxx:XXx" myjar-0.1.jar

Spark Streaming gets stopped without errors after ~1 minute

When I make spark-submit for the Spart Streaming job, I can see that it's running during approximatelly 1 minute, and then it's stopped with the final status SUCCEEDED:
16/11/16 18:58:16 INFO yarn.Client: Application report for application_XXXX_XXX (state: RUNNING)
16/11/16 18:58:17 INFO yarn.Client: Application report for application_XXXX_XXX (state: FINISHED)
I don't understand why it gets stopped, while I expect it to run for an undefined time and be triggered by messages received from the Kafka queue. In logs I can see all the println outputs, and there are no errors.
This is a short extract from the code:
val conf = new SparkConf().setAppName("MYTEST")
val sc = new SparkContext(conf)
sc.setCheckpointDir("~/checkpointDir")
val ssc = new StreamingContext(sc, Seconds(batch_interval_seconds))
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
println("Dividing the topic into partitions.")
val inputKafkaTopicMap = inputKafkaTopic.split(",").map((_, kafkaNumThreads)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, inputKafkaTopicMap).map(_._2)
messages.foreachRDD(msg => {
msg.foreach(s => {
if (s != null) {
//val result = ... processing goes here
//println(result)
}
})
})
// Start the streaming context in the background.
ssc.start()
This is my spark-submit command:
/usr/bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 10g --num-executors 2 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC \
-XX:+AlwaysPreTouch" --class org.test.StreamingRunner test.jar param1 param2
When I open Resource Manager, I see that no job is RUNNING and the spark streaming job gets marked as FINISHED.
Your code is missing a call to ssc.awaitTermination to block the driver thread.
Unfortunately there's no easy way to see the printouts from inside your map function on the console, since those function calls are happening inside YARN executors. Cloudera Manager provides a decent look at the logs though, and if you really need them collected on the driver you can write to a location in HDFS and then scrape the various logs from there yourself. If the information that you want to track is purely numeric you might also consider using an Accumulator.

spark 1.5.2 - Programmatically launching spark on yarn-client mode

we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here:
https://spark.apache.org/docs/1.5.2/quick-start.html#self-contained-applications\
Only additional configuration we would provide would be all related to yarn like executor instance, cores etc.
However after upgrading to spark 1.5.2 above application breaks on a line val sparkContext = new SparkContext(sparkConf)
It throws following in driver application:
16/01/28 17:38:35 ERROR util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:593)
So is this approach still supposed to work? Or do I must use SparkLauncher class with spark 1.5.2 to launch spark job programmatically on yarn-client mode?