receiving sparkcontext error while initializing sparksession - scala

I am receiving sparkcontext error while starting sparksession on EMR 5.3.1 in scala. Below is the version of spark I am using and the error.
This works fine on windows machine but errors out on EMR. Also is this the right way to create sparksession?
spark_version: 2.1.0
val spark = SparkSession
.builder
.master("local[*]")
.appName("vierweship_test")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
//.enableHiveSupport()
.getOrCreate()
Error:
17/08/01 13:08:14 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
I get below error if i do not use warehouse directory properties.
17/08/01 13:00:02 ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException
at java.io.File.<init>(File.java:277)
at org.apache.spark.deploy.yarn.Client.addDistributedUri$1(Client.scala:438)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:476)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:600)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
Command I am using :
spark-submit --verbose --class xxx --master yarn --jars="s3-dist-cp.jar:common-0.1.jar" --deploy-mode client --packages "Xxx:xxx:XXx" myjar-0.1.jar

Related

Spark Structured Streaming terminates immediately with spark-submit

I am trying to set up an ingestion pipeline using Spark structured streaming to read from Kafka and write to a Delta Lake table. I currently have a basic POC that I am trying to get running, no transformations yet. When working in the spark-shell, everything seems to run fine:
spark-shell --master spark://HOST:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,io.delta:delta-core_2.12:1.1.0
Starting and writing the stream:
val source = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "http://HOST:9092").option("subscribe", "spark-kafka-test").option("startingOffsets", "earliest").load().writeStream.format("delta").option("checkpointLocation", "/tmp/delta/checkpoint").start("/tmp/delta/delta-test")
However, once I pack this in to a Scala application and spark-submit the class with the required packages in a sbt assembly jar to the standalone spark instance, the stream seems to stop immediately and does not process any messages in the topic. I simply get the following logs:
INFO SparkContext: Invoking stop() from shutdown hook
...
INFO SparkContext: Successfully stopped SparkContext
INFO MicroBatchExecution: Resuming at batch 0 with committed offsets {} and available offsets {KafkaV2[Subscribe[spark-kafka-test]]: {"spark-kafka-test":{"0":6}}}
INFO MicroBatchExecution: Stream started from {}
Process finished with exit code 0
Here is my Scala class:
import org.apache.spark.sql.SparkSession
object Consumer extends App {
val spark = SparkSession
.builder()
.appName("Spark Kafka Consumer")
.master("spark://HOST:7077")
//.master("local")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.executor.memory", "1g")
.config("spark.executor.cores", "2")
.config("spark.cores.max", "2")
.getOrCreate()
val source = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "http://HOST:9092")
.option("subscribe", "spark-kafka-test")
.option("startingOffsets", "earliest")
.load()
.writeStream
.format("delta")
.option("checkpointLocation", "/tmp/delta/checkpoint")
.start("/tmp/delta/delta-test")
}
Here is my spark-submitcommand:
spark-submit --master spark://HOST:7077 --deploy-mode client --class Consumer --name Kafka-Delta-Consumer --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,io.delta:delta-core_2.12:1.1.0 <PATH-TO-JAR>/assembly.jar
Does anybody have an idea why the stream is closed and the program terminates? I am assuming memory is not a problem, as the whole Kafka topic is only a few bytes.
EDIT:
From some further investigations, I found the following behavior: On my confluent hub interface, I see that starting the stream via the spark-shell registers a consumer and active consumption is visible in monitoring.
On contrast, the spark-submit job is seemingly not able to register the consumer. On the driver logs, I found the following error:
WARN org.apache.spark.sql.kafka010.KafkaOffsetReaderConsumer - Error in attempt 1 getting Kafka offsets:
java.lang.NullPointerException
at org.apache.spark.kafka010.KafkaConfigUpdater.setAuthenticationConfigIfNeeded(KafkaConfigUpdater.scala:60)
In my case, I am working with one master and one worker on the same machine. There shouldn't be any networking differences between spark-shell and spark-submit executions, am I right?

pyspark dataframe error due to java.lang.ClassNotFoundException: org.postgresql.Driver

I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.

Hadoop, Spark: java.lang.NoSuchFieldError: TOKEN_KIND

I want to share an interesting error I've caught up recently:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(KMSClientProvider.java:166)
at org.apache.hadoop.security.token.Token.getRenewer(Token.java:351)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at org.apache.spark.deploy.security.HadoopFSCredentialProvider$$anonfun$getTokenRenewalInterval$1$$anonfun$5$$anonfun$apply$1.apply$mcJ$sp(HadoopFSDelegationTokeProvider.scala:119)
I was trying to spark2-submit a job to a remote driver host on Cloudera cluster like this:
spark = SparkSession.builder
.master("yarn")
.config("cluster")
.config("spark.driver.host", "remote_driver_host")
.config("spark.yarn.keytab", "path_to_pricnipar.keytab")
.config("spark.yarn.principal", "principal.name") \
.config("spark.driver.bindAddress", "0.0.0.0") \
.getOrCreate()
The Apache spark and Hadoop versions on Cloudera cluster are: 2.3.0 and 2.6.0 accordingly.
So the cause of issue was quite trivial, it is spark local binaries vs remote spark driver version mismatch.
Locally I had installed spark 2.4.5 and on Cloudera it was 2.3.0, after aligning the versions to 2.3.0, the issue resolved and the spark job completed successfully.

ERROR: java.lang.IllegalStateException: User did not initialize spark context

Scala version: 2.11.12
Spark version: 2.4.0
emr-5.23.0
Get the following when running the below command to create an Amazon EMR cluster
spark-submit --class etl.SparkDataProcessor --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.ETL_NAME=foo --conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn --conf spark.yarn.appMasterEnv.ETL_AWS_ACCESS_KEY_ID=123 --conf spark.yarn.appMasterEnv.ETL_AWS_SECRET_ACCESS_KEY=abc MY-Tool.jar
Exception
ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
How I create my spark session (where sparkMaster = yarn)
lazy val spark: SparkSession = {
val logger: Logger = Logger.getLogger("etl");
val sparkAppName = EnvConfig.ETL_NAME
val sparkMaster = EnvConfig.ETL_SPARK_MASTER
val sparkInstance = SparkSession
.builder()
.appName(sparkAppName)
.master(sparkMaster)
.getOrCreate()
val hadoopConf = sparkInstance.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.access.key", EnvConfig.ETL_AWS_ACCESS_KEY_ID)
hadoopConf.set("fs.s3a.secret.key", EnvConfig.ETL_AWS_SECRET_ACCESS_KEY)
logger.info("Created My SparkSession")
logger.info(s"Spark Application Name: $sparkAppName")
logger.info(s"Spark Master: $sparkMaster")
sparkInstance
}
UPDATE:
I determined that due to the application logic, in certain cases, we did not initialize the spark session. Because of this, it seems that when the cluster terminates, it also tries to do something with the session (perhaps close it) and is thus failing. Now that I have figured out this issue, the application runs but never actually completes. Currently, it seems to be hanging in a particular part involving spark when running in cluster mode:
val data: DataFrame = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(s"s3://$csvPath/$fileKey")
.toDF()
20/03/16 18:38:35 INFO Client: Application report for application_1584324418613_0031 (state: RUNNING)
AFAIK EnvConfig.ETL_AWS_ACCESS_KEY_ID and ETL_AWS_SECRET_ACCESS_KEY are not getting populated due to which sparksession cant be instanciated with null or empty values . try to print and debug the values.
also reading the properties from --conf spark.xxx
should be like this example. I hope you are following this...
spark.sparkContext.getConf.getOption("spark. ETL_AWS_ACCESS_KEY_ID")
once you check that, this example way should work...
/**
* Hadoop-AWS Configuration
*/
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.host", proxyHost)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.port", proxyPort)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem
another thing is, use
--master yarn or --master local[*] you can use instead of
-conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn
UPDATE :
--conf spark.driver.port=20002 may solve this issue. where 20002 is orbitary port.. seems like its waiting for the particular port for some time and its retrying for some time and its failing with the exception you got.
I got this idea by walking through the Sparks application master code from here
and comment This a bit hacky, but we need to wait until the spark.driver.port property has been set by the Thread executing the user class.
you can try this and let me know.
Further reading : Apache Spark : How to change the port the Spark driver listens to
In my case (after resolving the application issues), I needed to include core AND task node types when deploying in cluster mode.

Error instantiating 'org.apache.spark.sql.hive.HiveSessionState': on Linux server

I have a Scala Spark application that I'm trying to run on a Linux server using a shell script. I am getting the error:
Exception in thread "main" java.lang.IllegalArgumentException: Error
while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
However, I don't understand what is wrong. I am doing this to instantiate Spark:
val sparkConf = new SparkConf().setAppName("HDFStoES").setMaster("local")
val spark: SparkSession = SparkSession.builder.enableHiveSupport().config(sparkConf).getOrCreate()
Am I doing this correctly, if so what could be the error?
sparkSession = SparkSession.builder().appName("Test App").master("local[*])
.config("hive.metastore.warehouse.dir", hiveWareHouseDir)
.config("spark.sql.warehouse.dir", hiveWareHouseDir).enableHiveSupport().getOrCreate();
Use above, you need to specify the "hive.metastore.warehouse.dir" directory to enable hive support in spark session.