I have done two things to install the postgresql driver for use by spark:
copied the driver jar into the $SPARK_HOME/jars directory:
$ll $SPARK_HOME/jars/post*
-rw-r--r-- 1 stephenboesch staff 1046274 Nov 10 14:44 /usr/local/Cellar/apache-spark/3.3.1/libexec/jars/postgresql-42.5.0.jar
added the jar to SparkConf:
spark = (SparkSession.builder
.appName(appName)
.config('spark.jars','/usr/local/Cellar/apache-spark/3.3.1/libexec/jars/postgresql-42.5.0.jar')
.master(master)
.enableHiveSupport()
.getOrCreate())
But when running the pyspark script the driver is not found?
Caused by: java.sql.SQLException: No suitable driver found for jdbc:postgresql://localhost:5432/hive
Apparently the enableHiveSupport() kills it .. ? By commenting it out now it all works
spark = (SparkSession.builder
.appName(appName)
.config('spark.jars','/usr/local/Cellar/apache-spark/3.3.1/libexec/jars/postgresql-42.5.0.jar')
.master("local")
# .enableHiveSupport() # Not sure why enabling this loses the driver..
.getOrCreate())
Related
I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.
I want to share an interesting error I've caught up recently:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(KMSClientProvider.java:166)
at org.apache.hadoop.security.token.Token.getRenewer(Token.java:351)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at org.apache.spark.deploy.security.HadoopFSCredentialProvider$$anonfun$getTokenRenewalInterval$1$$anonfun$5$$anonfun$apply$1.apply$mcJ$sp(HadoopFSDelegationTokeProvider.scala:119)
I was trying to spark2-submit a job to a remote driver host on Cloudera cluster like this:
spark = SparkSession.builder
.master("yarn")
.config("cluster")
.config("spark.driver.host", "remote_driver_host")
.config("spark.yarn.keytab", "path_to_pricnipar.keytab")
.config("spark.yarn.principal", "principal.name") \
.config("spark.driver.bindAddress", "0.0.0.0") \
.getOrCreate()
The Apache spark and Hadoop versions on Cloudera cluster are: 2.3.0 and 2.6.0 accordingly.
So the cause of issue was quite trivial, it is spark local binaries vs remote spark driver version mismatch.
Locally I had installed spark 2.4.5 and on Cloudera it was 2.3.0, after aligning the versions to 2.3.0, the issue resolved and the spark job completed successfully.
I am trying to register a udf in scala spark like this where registering the following udf works in hive create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'
val sparkSess = SparkSession.builder()
.appName("Opens")
.enableHiveSupport()
.config("set hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
sparkSess.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
I get an error saying
Exception in thread "main" java.net.MalformedURLException: unknown protocol: s3
Would like to know if I have to set something in config or anything else , I have just started learning.
Any help with this is appreciated.
Why not add this gdpr-hive-udfs-hadoop.jar as an external jar to your project and then do this to register the udf:
val sqlContext = sparkSess.sqlContext
val udf_parallax = sqlContext.udf .register("udf_parallax", com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash _)
Update:
1.If your hive is running on remote server:
val sparkSession= SparkSession.builder()
.appName("Opens")
.config("hive.metastore.uris", "thrift://METASTORE:9083")
.config("set hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSession.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
2.If hive is not running on remote server:
Copy the hive-site.xml from your /hive/conf/ directory to /spark/conf/ directory and create the SparkSession as you have mentioned in the question
I have a Scala Spark application that I'm trying to run on a Linux server using a shell script. I am getting the error:
Exception in thread "main" java.lang.IllegalArgumentException: Error
while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
However, I don't understand what is wrong. I am doing this to instantiate Spark:
val sparkConf = new SparkConf().setAppName("HDFStoES").setMaster("local")
val spark: SparkSession = SparkSession.builder.enableHiveSupport().config(sparkConf).getOrCreate()
Am I doing this correctly, if so what could be the error?
sparkSession = SparkSession.builder().appName("Test App").master("local[*])
.config("hive.metastore.warehouse.dir", hiveWareHouseDir)
.config("spark.sql.warehouse.dir", hiveWareHouseDir).enableHiveSupport().getOrCreate();
Use above, you need to specify the "hive.metastore.warehouse.dir" directory to enable hive support in spark session.
I've installed Spark on a Windows machine and want to use it via Spyder. After some troubleshooting the basics seems to work:
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md")
textFile.count()
textFile.filter(lambda line: "Spark" in line).count()
sc.stop()
This runs as expected. I now want to connect to a Postgres9.3 database running on the same server. I have downloaded the JDBC driver from here here and have put it in the folder D:\Analytics\Spark\spark_jars. I've then created a new file D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6\conf\spark-defaults.conf containing this line:
spark.driver.extraClassPath 'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar'
I've ran the following code to test the connection
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]",
dbtable="pubs")
)
sc.stop()
But am getting the following error:
Py4JJavaError: An error occurred while calling o22.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
How can I check whether I've downloaded the right .jar file or where else the error might come from?
I have tried SPARK_CLASSPATH environment variable but it doesn't work with Spark 1.6.
Other answers from posts like below suggested adding pyspark command arguments and it works.
Not able to connect to postgres using jdbc in pyspark shell
Apache Spark : JDBC connection not working
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Remove spark-defaults.conf and add the SPARK_CLASSPATH to the system environment in python like this:
os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'
Another way to connect pyspark with your postrgresql db.
1) Install spark with pip: pip install pyspark
2) Download last version of jdbc postgresql connector in:
https://jdbc.postgresql.org/download.html
3) Complete this code with your db credentials:
from __future__ import print_function
from pyspark.sql import SparkSession
def jdbc_dataset_example(spark):
df = spark.read \
.jdbc("jdbc:postgresql://[your_db_host]:[your_db_port]/[your_db_name]",
"com_dim_city",
properties={"user": "[your_user]", "password": "[your_password]"})
df.createOrReplaceTempView("[your_table]")
sqlDF = spark.sql("SELECT * FROM [your_table] LIMIT 10")
sqlDF.show()
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
jdbc_dataset_example(spark)
spark.stop()
Finally run your aplication with:
spark-submit --driver-class-path /path/to/your_jdbc_jar/postgresql-42.2.6.jar --jars postgresql-42.2.6.jar /path/to/your_jdbc_jar/test_pyspark_to_postgresql.py