connecting to auroradb with postgres jar driver local mode in pyspark - postgresql

Im trying to connect to an aurora db using a the jar file dowwnloaded locally this is my code :
spark = SparkSession.builder.master("local").appName("PySpark_app").config("spark.driver.memory", "16g").config("spark.jars", "D:/spark/jars/postgresql-42.2.5.jar")\
.getOrCreate()
spark_df = spark.read.format("jdbc").option("url", "postgresql://aws-som3l1nk.us-west-2.rds.amazonaws.com") \
.option("driver", "org.postgresql.Driver").option("user", "username")\
.option("password", "password").option("query",query).load()
after trying to read the data i get this error :
An error occurred while calling o401.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader.findClass(Unknown Source)...
not sure if im missing something or even if the jar file is considered or not.
im working on a local machine on this and trying to install packages using pyspark --packages fail as well!

Related

pyspark dataframe error due to java.lang.ClassNotFoundException: org.postgresql.Driver

I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.

Spark cannot find SQL server jdbc driver even if both the mssql.jar and the .dll are present in the path

I am trying to connect Spark to a SQL server using this:
#Myscript.py
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example").config("spark.driver.extraClassPath","/home/mssql-jdbc-9.2.1.jre15.jar:/home/sqljdbc_auth.dll")\
.getOrCreate()
sqldb = spark.read \
.format("jdbc") \
.option("url", "jdbc:sqlserver://server:5150;databaseName=testdb;integratedSecurity=true") \
.option("dbtable", "test_tbl") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
sqldb.select('coldate').show()
I have made sure that both the .dll and the .jar is under /home folder. I call it like so:
spark-submit --jars /home/sqljdbc41.jar MyScript.py
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is not configured for integrated authentication. ClientConnectionId:3462d79d-c165-4607-9790-67a2c786a9cf
Seems like it cannot find the .dll file? I ahve verified it exists under /home.
This error was resolved when I placed the sqljdbc_auth.dll file in "C:\Windows\System32" folder.
For those who want to know where to find this dll file, you may:
Download the JDBC Driver for SQL Server (sqljdbc_6.0.8112.200_enu.exe) from the Microsoft website below:
https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Unzip it and navigate as follows:
\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\auth\x64

Unable to access to Hive warehouse directory with Spark

I'm trying to connect to the Hive warehouse directory by using Spark on IntelliJ which is located at the following path :
hdfs://localhost:9000/user/hive/warehouse
In order to do this, I'm using the following code :
import org.apache.spark.sql.SparkSession
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "hdfs://localhost:9000/user/hive/warehouse"
val spark = SparkSession
.builder()
.appName("Spark Hive Local Connector")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.enableHiveSupport()
.getOrCreate()
spark.catalog.listDatabases().show(false)
spark.catalog.listTables().show(false)
spark.conf.getAll.mkString("\n")
import spark.implicits._
import spark.sql
sql("USE test")
sql("SELECT * FROM test.employee").show()
As one can see, I have created a database 'test' and create a table 'employee' into this database using the hive console. I want to get the result of the latest request.
The 'spark.catalog.' and 'spark.conf.' are used in order to print the properties of the warehouse path and database settings.
spark.catalog.listDatabases().show(false) gives me :
name : default
description : Default Hive database
locationUri : hdfs://localhost:9000/user/hive/warehouse
spark.catalog.listTables.show(false) gives me an empty result. So something is wrong at this step.
At the end of the execution of the job, i obtained the following error :
> Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'test' not found;
I have also configured the hive-site.xml file for the Hive warehouse location :
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
</property>
I have already created the database 'test' using the Hive console.
Below, the versions of my components :
Spark : 2.2.0
Hive : 1.1.0
Hadoop : 2.7.3
Any ideas ?
Create the resource directory under the src in your IntelliJ project copy the conf files under this folder. Build the project .. Ensure to define hive.metastore.warehouse.uris path correctly refer the hive-site.xml . In log if your are getting INFO metastore: Connected to metastore then you are good to go. Example.
Kindly note it making connection to intellij and running the job will be slow compare to package the jar and running on your hadoop cluster.

Spark 2.2.0 unable to connect to Phoenix 4.11.0 version in loading the table to DF

I'm using the below techstack and trying to connect Phoenix tables using PySpark code. I have downloaded the following jars from the url and tried executing the below code. In logs the connection to hbase is established but the console is stuck with out doing nothing. Please let me know if anybody encountered and fixed similar issue.
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/4.11.0-HBase-1.2
jars:
phoenix-spark-4.11.0-HBase-1.2.jar
phoenix-client.jar
Tech Stack all running in same host:
Apache Spark 2.2.0 Version
Hbase 1.2 Version
Phoenix 4.11.0 Version
Copied the hbase-site.xml in the folder path /spark/conf/hbase-site.xml.
Command executed ->
usr/local/spark> spark-submit phoenix.py --jars /usr/local/spark/jars/phoenix-spark-4.11.0-HBase-1.2.jar --jars /usr/local/spark/jars/phoenix-client.jar
Phoenix.py:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.phoenix.spark").option("table",
"schema.table1").option("zkUrl", "localhost:2181").load()
df.show()
Error log: Hbase Connection is established, however in the console it is stuck and timing out error is thrown
18/07/30 12:28:15 WARN HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
18/07/30 12:28:54 INFO RpcRetryingCaller: Call exception, tries=10, retries=35, started=38367 ms ago, cancelled=false, msg=row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master01,16020,1532591192223, seqNum=0
Take a look at these answers :
phoenix jdbc doesn't work, no exceptions and stuck
HBase Java client - unknown host: localhost.localdomain
Both of the issues happened in Java (with JDBC), but it looks like it's a similar issue here.
Try to add ZooKeeper hostname (master01, as I see in the error message) to your /etc/hosts :
127.0.0.1 master01
if you are running all your stack locally.

PySpark: java.sql.SQLException: No suitable driver

I have spark code which connects to Netezza and reads a table.
conf = SparkConf().setAppName("app").setMaster("yarn-client")
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
nz_df=hc.load(source="jdbc",url=address dbname";username=;password=",dbtable="")
I do spark-submit and run the code in the following way..
spark-submit -jars nzjdbc.jar filename.py
And I get the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.load.
: java.sql.SQLException: No suitable driver
Am I doing anything wrong over here?? is the jar not suitable or is it not able to recgonize the jar?? please let me know the correct way if this is not and also can anyone provide the link to get the jar for connecting netezza from spark.
I am using the 1.6.0 version of spark.