I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.
I wanted to understand if there is a method one can connect to MySQL database over SSH using Private-Public Key pair using Spark notebook for scala?
I have been trying to modify this code to no avail
Connect to MySQL over SSH using Java
I'm trying to read/write data to an Azure SQL Data Warehouse from a spark on demand HDInsight cluster.
I can do this from a normal HDInsight spark cluster by using a script action to install the jdbc driver but I don't think it's possible to run script actions on the on demand clusters.
I've tried
Copying the files from %user%.m2\repository\com\microsoft\sqlserver\mssql-jdbc\6.2.2.jre8 up to blob storage in a folder called jars next to where the built spark code is.
including the driver dependency in the built jar file
Both of these led to a java.lang.NoClassDefFoundError
I'm not too familiar with scala/maven/JVM/etc so not sure what else to try or include in this SO question.
Scala code i'm trying to run is
val sqlContext = SparkSession.builder().appName("GenerateEventsSql").getOrCreate()
val jdbcSqlConnStr = "jdbc:sqlserver://someserver.database.windows.net:1433;databaseName=myDW;user=admin;password=XXXX;"
val tableName = "dbo.SomeTable"
val allTableData = sqlContext.read.format("jdbc")
.options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConnStr, "dbtable" -> tableName)
)
.load()
Jars on Blob storage folder are not accessible to the Class path of HDinsight spark job. You need to copy the jar files to the local host for example /tmp/jars/xyz.jar and mention the same in Spark-submit command.
For e.g.
nohup spark-submit --jars /tmp/jars/xyz.jar
I am trying to get the following use case:
spark read files from HDFS with Kerberos in parquet format
spark write this files in csv format
If I write to hdfs, it works perfectly. If I try to write to local filesystem, it doesn´t work: "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am using Spark 1.6.2.
To sumarize, my code is
val dfIn = sqc.read.parquet(pathIsilon)
dfIn.coalesce(1).write.format("com.databricks.spark.csv").save(pathFilesystem)
How to connect and load files in remote BigInsights HDFS(kerberos authentication enabled) from local pyspark program for processing?
df = sqlContext.read.parquet("hdfs://<<remote_hdfs_host>>:8020/testDirectory")
Help would be much appreciated.