Pyspark in docker container postgresql dababase connection - postgresql

I am trying to connect to a postgres database on the localhost:5432 of my computer using pyspark inside a docker container. For this I use VS code. VS code automatically builds and runs the container. This is the code I have:
password = ...
user = ...
url = 'jdbc:postgresql://127.0.0.1:5432/postgres'
spark = SparkSession.builder.config("spark.jars","/opt/spark/jars/postgresql-42.2.5.jar") \
.appName("PySpark_Postgres_test").getOrCreate()
df = connector.read.format("jbdc") \
.option("url", url) \
.option("dbtable", 'chicago_crime') \
.option("user", user) \
.option("password", password) \
.option("driver", "org.postgresql.Driver") \
.load()
I keep getting the same error:
"An error occurred while calling o358.load.\n:
java.lang.ClassNotFoundException: \nFailed to find data source: jbdc. ...
Maybe the url is not correct?
url = 'jdbc:postgresql://127.0.0.1:5432/postgres'
The database is on port 5432 and has the name postgres. The database is on my localhost but since I am working in a docker container I assumed the correct way would be to enter the ip adress of your laptops localhost 127.0.0.1. If you type localhost it would refer to the localhost of your docker container. Or should I use the IPv4 Address (Wireless Lan .. or wsl).
Anyone knows what's wrong?
ps, one of the commands in my dockerfile is the following:
RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar -P /opt/spark/jars

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/opt/spark/jars/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://host.docker.internal:5432/postgres") \
.option("dbtable", "chicago_crime") \
.option("user", "postgres") \
.option("password", "postgres") \
.option("driver", "org.postgresql.Driver") \
.load()

Related

Pyspark not able to read from bigquery table

I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png

Connect pySpark to aws aurora mysql with certificate

I am running pyspark on AWS Glue and I am trying to connect to an aurora mysql db with a third party jdbc(not the AWS one but J connect). The problem I am facing is that I do not know how to pass the certificate (.pem) so I can successfully connect to that db.
spark = SparkSession.builder.enableHiveSupport() \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.28.jar,mysql:mysql-connector-java:8.0.28") \
.appName($JOB) \
.getOrCreate()
url = "jdbc:mysql://host_name:3306/db_name"
crtf = "tls_ca_cert" # location of the certificate
df = spark.read \
.format("jdbc") \
.option("url", url) \
.option("table", table) \
.option("user", user) \
.option("password", psw) \
.option("ssl", True) \
.option("sslmode", "require") \
.option("SSLServerCertificate", crtf) \
.load()
This is how I am trying to read the db. Obviously I am missing something. Any help would be appreciated!

Spark cannot find SQL server jdbc driver even if both the mssql.jar and the .dll are present in the path

I am trying to connect Spark to a SQL server using this:
#Myscript.py
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example").config("spark.driver.extraClassPath","/home/mssql-jdbc-9.2.1.jre15.jar:/home/sqljdbc_auth.dll")\
.getOrCreate()
sqldb = spark.read \
.format("jdbc") \
.option("url", "jdbc:sqlserver://server:5150;databaseName=testdb;integratedSecurity=true") \
.option("dbtable", "test_tbl") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
sqldb.select('coldate').show()
I have made sure that both the .dll and the .jar is under /home folder. I call it like so:
spark-submit --jars /home/sqljdbc41.jar MyScript.py
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is not configured for integrated authentication. ClientConnectionId:3462d79d-c165-4607-9790-67a2c786a9cf
Seems like it cannot find the .dll file? I ahve verified it exists under /home.
This error was resolved when I placed the sqljdbc_auth.dll file in "C:\Windows\System32" folder.
For those who want to know where to find this dll file, you may:
Download the JDBC Driver for SQL Server (sqljdbc_6.0.8112.200_enu.exe) from the Microsoft website below:
https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Unzip it and navigate as follows:
\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\auth\x64

Cannot write to postgres

I have a postgres database on an EC2 machine. Using PySpark on a cluster setup I am trying to write to the postgresDB but am not able to.
The Postgres Database has a DB: my_db, followed by a table events.
My PySpark code is:
df.write.format("jdbc") \
.option("url", "jdbc:postgresql://ec2-xxxxx.compute-1.amazonaws.com:543x/my_db") \
.option("dbtable", "events") \
.option("user", "xxx") \
.option("password", "xxx") \
.option("driver", "org.postgresql.Driver").mode('append').save()
When executing I receive this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o69.save. : org.postgresql.util.PSQLException: ERROR: relation "events" already exists
It seems that it creates a new table when I execute spark-submit, how to solve this error?

Is it possible to connect to a postgresql DB and load a postgresql table to a sparkDataFrame from Databricks notebook

I need to connect to postgresql DB available in on-prem subscription to my Azure Databricks notebook(cloud subscription) and load a postgresql table to a sparkDataFrame, please let me know if anybody has worked on it, i know i can run the below pyspark code to read the data from a table but need help on how to establish a connection from my Azure Databricks notebook to postgresql DB available in on-prem subscription.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()