I have:
- Hadoop
- Spark JobServer
- SQL Database
I have created a file to access my SQL Database from a local instance of the Spark JobServer. In order to do this, I first have to load my JDBC-driver with this command: Class.forName("com.mysql.jdbc.Driver");. However, when I try to execute the file on Spark JobServer, I get a classNotFound error:
"message": "com.mysql.jdbc.Driver",
"errorClass": "java.lang.ClassNotFoundException",
I have read that in order to load the JDBC driver, you have to change some configurations in either the application.conf file of the Spark JobServer or its server_start.sh file. I have done this as follows. In server_start.sh I have changed the cmd value which is sent with as spark-submit command:
cmd='$SPARK_HOME/bin/spark-submit --class $MAIN --driver-memory $JOBSERVER_MEMORY
--conf "spark.executor.extraJavaOptions=$LOGGING_OPTS spark.executor.extraClassPath = hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--driver-java-options "$GC_OPTS $JAVA_OPTS $LOGGING_OPTS $CONFIG_OVERRIDES"
--driver-class-path "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
--jars "hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"
$# $appdir/spark-job-server.jar $conffile'
I also changed some lines of the application.conf file of the Spark JobServer which is used when starting the instance:
# JDBC driver, full classpath
jdbc-driver = com.mysql.jdbc.Driver
# dependent-jar-uris = ["hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar"]
But the error that JDBC class cannot be found still comes back.
Already checked for the following errors:
ERROR1:
In case somebody thinks that I just have the wrong file path (which could very well be the case as far as I know myself), I have checked for the correct file on HDFS with hadoop fs -ls hdfs://quickstart.cloudera:8020/user/cloudera/ and the file was there:
-rw-r--r-- 1 cloudera cloudera 983914 2016-01-26 02:23 hdfs://quickstart.cloudera:8020/user/cloudera/mysql-connector-java-5.1.38-bin.jar
ERROR2:
I have the necessary dependency loaded in my build.sbt file: libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.+" and the import command in my scala-file import java.sql._.
How can I solve this ClassNotFound error?
Are there any good alternatives to JDBC to connect to SQL?
We have something like this in local.conf
# JDBC driver, full classpath
jdbc-driver = org.postgresql.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = "/var/spark-jobserver/sqldao/data"
jdbc {
url = "jdbc:postgresql://dbserver/spark_jobserver"
user = "****"
password = "****"
}
dbcp {
maxactive = 20
maxidle = 10
initialsize = 10
}
And in the start script I have
EXTRA_JARS="/opt/spark-jobserver/lib/*"
CLASSPATH="$appdir:$appdir/spark-job-server.jar:$EXTRA_JARS:$(dse spark-classpath)"
And all dependent files that are used by Spark Jobserver is put in /opt/spark-jobserver/lib
I have not used HDFS to load jar for job-server.
But if you need mysql driver to be loaded on spark worker nodes then you should do it via dependent-jar-uris. I think that is what you are doing now.
I have packaged the project using sbt assembly and it finally works and I am happy.
But it's actually not working to have HDFS files in your dependent-jar-uri. So don't use HDFS links as your dependent-jar-uris.
Also, read this link in case you are curious: https://github.com/spark-jobserver/spark-jobserver/issues/372
Related
Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.
I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0
Im trying to connect to an aurora db using a the jar file dowwnloaded locally this is my code :
spark = SparkSession.builder.master("local").appName("PySpark_app").config("spark.driver.memory", "16g").config("spark.jars", "D:/spark/jars/postgresql-42.2.5.jar")\
.getOrCreate()
spark_df = spark.read.format("jdbc").option("url", "postgresql://aws-som3l1nk.us-west-2.rds.amazonaws.com") \
.option("driver", "org.postgresql.Driver").option("user", "username")\
.option("password", "password").option("query",query).load()
after trying to read the data i get this error :
An error occurred while calling o401.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader.findClass(Unknown Source)...
not sure if im missing something or even if the jar file is considered or not.
im working on a local machine on this and trying to install packages using pyspark --packages fail as well!
I'm trying to connect to the Hive warehouse directory by using Spark on IntelliJ which is located at the following path :
hdfs://localhost:9000/user/hive/warehouse
In order to do this, I'm using the following code :
import org.apache.spark.sql.SparkSession
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "hdfs://localhost:9000/user/hive/warehouse"
val spark = SparkSession
.builder()
.appName("Spark Hive Local Connector")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.enableHiveSupport()
.getOrCreate()
spark.catalog.listDatabases().show(false)
spark.catalog.listTables().show(false)
spark.conf.getAll.mkString("\n")
import spark.implicits._
import spark.sql
sql("USE test")
sql("SELECT * FROM test.employee").show()
As one can see, I have created a database 'test' and create a table 'employee' into this database using the hive console. I want to get the result of the latest request.
The 'spark.catalog.' and 'spark.conf.' are used in order to print the properties of the warehouse path and database settings.
spark.catalog.listDatabases().show(false) gives me :
name : default
description : Default Hive database
locationUri : hdfs://localhost:9000/user/hive/warehouse
spark.catalog.listTables.show(false) gives me an empty result. So something is wrong at this step.
At the end of the execution of the job, i obtained the following error :
> Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'test' not found;
I have also configured the hive-site.xml file for the Hive warehouse location :
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
</property>
I have already created the database 'test' using the Hive console.
Below, the versions of my components :
Spark : 2.2.0
Hive : 1.1.0
Hadoop : 2.7.3
Any ideas ?
Create the resource directory under the src in your IntelliJ project copy the conf files under this folder. Build the project .. Ensure to define hive.metastore.warehouse.uris path correctly refer the hive-site.xml . In log if your are getting INFO metastore: Connected to metastore then you are good to go. Example.
Kindly note it making connection to intellij and running the job will be slow compare to package the jar and running on your hadoop cluster.
I am using MapR5.2 - Spark version 2.1.0
And i am running my spark app jar in Yarn CLuster mode.
I have tried all the available options that i found But unable to succeed.
This is our Production environment. But i need that for my particular spark job it should follow and pick-up my log4j-Driver.properties file which is present in my src/main/resources folder(I also confirmed by opening the jar my property file is present)
1) Content of My Log File -> log4j-Driver.properties
log4j.rootCategory=DEBUG, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/users/myuser/logs/Myapp.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=100MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
2) My Script for Spark-Submit Command
propertyFile=application.properties
spark-submit --class MyAppDriver \
--conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/users/myuser/log4j-Driver.properties" \
--master yarn --deploy-mode cluster \
--files /users/myuser/log4j-Driver.properties,/opt/mapr/spark/spark-2.1.0/conf/hive-site.xml,/users/myuser/application.properties
/users/myuser/myapp_2.11-1.0.jar $propertyFile
All i Need is as of now i am trying to Write my Driver Logs in the directory mentioned in my properties file(mentioned above) If i am successful in this then i will try for Executor logs as well. But first i need to make this Driver Log to write on my local (and its an Edge node of our Cluster)
/users/myuser/log4j-Driver.properties seems to be the path to the file on your local computer so you were right to use it for the --files argument.
The problem is, that there's no such file on the driver and/or executor, so when you use file:/users/myuser/log4j-Driver.properties as an argument to -Dlog4j.configuration Log4j will fail to find it.
Since you run on YARN, files listed as arguments to --files will be submitted to HDFS. Each application submission will have its own base directory in HDFS where all the files will be put by spark-submit.
In order to refer to these files use relative paths. In your case --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=log4j-Driver.properties" should work.
I'm trying to get a basic Spark example running using mongoDB hadoop connector. I'm using Hadoop version 2.6.0. I'm using version 1.3.1 of mongo-hadoop. I'm not sure where exactly to place the jars for this Hadoop version. Here are the locations I've tried:
$HADOOP_HOME/libexec/share/hadoop/mapreduce
$HADOOP_HOME/libexec/share/hadoop/mapreduce/lib
$HADOOP_HOME/libexec/share/hadoop/hdfs
$HADOOP_HOME/libexec/share/hadoop/hdfs/lib
Here is the snippet of code I'm using to load a collection into Hadoop:
Configuration bsonConfig = new Configuration();
bsonConfig.set("mongo.job.input.format", "MongoInputFormat.class");
JavaPairRDD<Object,BSONObject> zipData = sc.newAPIHadoopFile("mongodb://127.0.0.1:27017/zipsdb.zips", MongoInputFormat.class, Object.class, BSONObject.class, bsonConfig);
I get the following error no matter where the jar is placed:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:505)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:471)
I dont see any other errors in hadoop logs. I suspect I'm missing something in my configuration, or that Hadoop 2.6.0 is not compatible with this connector. Any help is much appreciated.