Spark cannot find the postgres jdbc driver - postgresql

EDIT: See the edit at the end
First of all, I am using Spark 1.5.2 on Amazon EMR and using Amazon RDS for my postgres database. Second is that I am a complete newbie in this world of Spark and Hadoop and MapReduce.
Essentially my problem is the same as for this guy:
java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL
So the dataframe is loaded, but when I try to evaluate it (doing df.show(), where df is the dataframe) gives me the error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://mypostgres.cvglvlp29krt.eu-west-1.rds.amazonaws.com:5432/mydb
I should note that I start spark like this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar
The solutions suggest delivering the jar onto the worker nodes and setting the classpath on them somehow, which I don't really understand how to do. But then they say that apparently the issue was fixed in Spark 1.4, and I'm using 1.5.2, and still having this issue, so what is going on?
EDIT: Looks like I resolved the issue, however I still don't quite understand why this works and the thing above doesn't, so I guess my questions is now why does doing this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar --conf spark.driver.extraClassPath=/home/hadoop/postgresql-9.4.1207.jre7.jar --jars /home/hadoop/postgresql-9.4.1207.jre7.jar
solve the problem? I just added the path as a parameter into some more of the flags it seems.

spark-shell --driver-class-path .... --jars ... works because all jar files listed in --jars are automatically distributed over the cluster.
Alternatively you could use
spark-shell --packages org.postgresql:postgresql:9.4.1207.jre7
and specify driver class as an option for DataFrameReader / DataFrameWriter
val df = sqlContext.read.format("jdbc").options(Map(
"url" -> url, "dbtable" -> table, "driver" -> "org.postgresql.Driver"
)).load()
or even manually copy required jars to the workers and place these somewhere on the CLASSPATH.

Related

How do you get the driver and executors to load and recognize the postgres driver in EMR with spark-submit?

BACKGROUND
I am trying to run a spark-submit command that streams from Kafka and performs a JDBC sink into a postgres DB in AWS EMR (version 5.23.0) and using scala (version 2.11.12). The errors I see are
INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on <master-public-dns-name>, executor 1: java.sql.SQLException (No suitable driver found for jdbc:postgres://...
ERROR WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#44dd5258 is aborting.
19/06/20 06:11:26 ERROR WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#44dd5258 aborted.
HYPOTHESIS PROBLEM
I think the error is telling me that the jdbc postgres driver cannot be found on the executors, which is why it cannot sink to postgres.
PREVIOUS ATTEMPTS
I have already done the following:
Identified my driver in my structured streaming job as Class.forName("org.postgresql.Driver")
added --jars postgresql-42.1.4.jar \ to my spark-submit job in order to send the jars to the driver and executors. In this attempt, this postgres driver jar exists in my local /home/user_name/ directory
Also tried --jars /usr/lib/spark/jars/postgresql-42.1.4.jar \ to my spark-submit job, which is the location that spark in emr finds all the jars for execution
started my spark-submit job with spark-submit --driver-class-path /usr/lib/spark/jars/postgresql-42.1.4.jar:....
added the /usr/lib/spark/jars/postgresql-42.1.4.jar to the spark.driver.extraClassPath, spark.executor.extraClassPath, spark.yarn.dist.jars, spark.driver.extraLibraryPath, spark.yarn.secondary.jars, java.library.path, and to the System Classpath in general
My jdbc connection, while working in Zeppelin, does not work in spark-submit. It is jdbc:postgres://master-public-dns-name:5432/DBNAME"
EXPECTED RESULT:
I expect my executors to recognize the postgres driver and sink the data to the postgres DB.
PREVIOUS ATTEMPTS:
I've already used the following suggestions to no avail:
Adding JDBC driver to Spark on EMR
No Suitable Driver found Postgres JDBC
No suitable driver found for jdbc:postgresql://192.168.1.8:5432/NexentaSearch
use -- packages org.postgresql:postgresql:<VERSION>

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

H2o Package not found Scala Sparkling Water

I am trying to run Sparkling Water on my Local instance of Spark 2.1.0.
I followed documentation on H2o for Sparling Water. But when I try to execute
sparkling-shell.cmd
I am getting following error :
The filename, directory name, or volume label syntax is incorrect.
I look into the batch file and I am getting this error when the following command is executed:
C:\Users\Mansoor\libs\spark\spark-2.1.0/bin/spark-shell.cmd --jars C:\Users\Mansoor\libs\H2o\sparkling\bin\../assembly/build/libs/sparkling-water-assembly_2.11-2.1.0-all.jar --driver-memory 3G --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"
When I remove --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m", Spark starts but I am unable to import the packages of H2o.
import org.apache.spark.h2o._
error: object h2o is not a member of package org.apache.spark
I tried everything I could but unable to solve this issue. Could someone help me in this? Thanks
Please try to correct your path:
C:\Users\Mansoor\libs\spark\spark-2.1.0/bin/spark-shell.cmd --jars C:\Users\Mansoor\libs\H2o\sparkling\bin\..\assembly\build\libs\sparkling-water-assembly_2.11-2.1.0-all.jar --driver-memory 3G --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"
There is also doc page about RSparkling at Windows, which can contain different troubleshooting tips...
https://github.com/h2oai/sales-engineering/tree/master/megan/RSparklingAndWindows
Problem is with spark-shell command while submitting jars. Workaround is to modify spark-defaults.conf
Adding spark.driver.extraClassPath and spark.executor.extraClassPath parameters to spark-defaults.conf file as follows:
spark.driver.extraClassPath \path\to\jar\sparkling-water-assembly_version>-all.jar
spark.executor.extraClassPath \path\to\jar\sparkling-water-assembly_version>-all.jar
And Remove --jars \path\to\jar\sparkling-water-assembly_version>-all.jar from sparkling-shell2.cmd

Working with jdbc jar in pyspark

I need to read from a postgres sql database in pyspark.
I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.
I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:
pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar
Inside pyspark I did:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()
However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:
java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver
If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)
So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.
Update:
I did some more looking and found this in the documentation:
"The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".
The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.
I did find this which basically explains that this needs to be done locally.
I also found this which basically says there is a fix but it is not yet available in version 1.6.1.
I found a solution which works (Don't know if it is the best one so feel free to continue commenting).
Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()
Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4.1208</version>
</dependency>
and then you don't have to add the driver as a separate jar, just use the jar with dependencies.
What version of the documentation are you looking at ?
It seems like compute-classpath.sh was deprecated a while back - as of Spark 1.3.1:
$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
6592 2015-04-11 00:04 spark-1.3.1/bin/compute-classpath.sh
$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa
produces nothing.
I think you should be using load-spark-env.sh to set your classpath:
$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh
and you'll need to set SPARK_CLASSPATH in your $SPARK_HOME/conf/spark-env.sh file (which you'll copy over from the template file $SPARK_HOME/conf/spark-env.sh.template).
I think that this is another manifestation of the issue described and fixed here: https://github.com/apache/spark/pull/12000. I authored that fix 3 weeks ago and there has been no movement on it. Maybe if others also express the fact that they have been affected by it, it may help?

Exception after Setting property 'spark.sql.hive.metastore.jars' in 'spark-defaults.conf'

Given below is the version of Spark & Hive I have installed in my system
Spark : spark-1.4.0-bin-hadoop2.6
Hive : apache-hive-1.0.0-bin
I have configured the Hive installation to use MySQL as Metastore. The goal is to access the MySQL Metastore & execute HiveQL queries inside spark-shell(using HiveContext)
So far I am able to execute the HiveQL queries by accessing the Derby Metastore(As described here, believe Spark-1.4 comes bundled with Hive 0.13.1 which in turn uses the internal Derby database as Metastore)
Then I tried to point spark-shell to my external Metastore(MySQL in this case) by setting the property(as suggested here) given below in $SPARK_HOME/conf/spark-defaults.conf,
spark.sql.hive.metastore.jars /home/mountain/hv/lib:/home/mountain/hp/lib
I have also copied $HIVE_HOME/conf/hive-site.xml into $SPARK_HOME/conf. But I am getting the following exception when I start the spark-shell
mountain#mountain:~/del$ spark-shell
Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars.
Am I missing something (or) not setting the property spark.sql.hive.metastore.jars correctly?
Note: In Linux Mint verified.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
Corrupted version of hive-site.xml will cause this... please copy the correct hive-site.xml