I am running pyspark on AWS Glue and I am trying to connect to an aurora mysql db with a third party jdbc(not the AWS one but J connect). The problem I am facing is that I do not know how to pass the certificate (.pem) so I can successfully connect to that db.
spark = SparkSession.builder.enableHiveSupport() \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.28.jar,mysql:mysql-connector-java:8.0.28") \
.appName($JOB) \
.getOrCreate()
url = "jdbc:mysql://host_name:3306/db_name"
crtf = "tls_ca_cert" # location of the certificate
df = spark.read \
.format("jdbc") \
.option("url", url) \
.option("table", table) \
.option("user", user) \
.option("password", psw) \
.option("ssl", True) \
.option("sslmode", "require") \
.option("SSLServerCertificate", crtf) \
.load()
This is how I am trying to read the db. Obviously I am missing something. Any help would be appreciated!
Related
I am trying to connect to a postgres database on the localhost:5432 of my computer using pyspark inside a docker container. For this I use VS code. VS code automatically builds and runs the container. This is the code I have:
password = ...
user = ...
url = 'jdbc:postgresql://127.0.0.1:5432/postgres'
spark = SparkSession.builder.config("spark.jars","/opt/spark/jars/postgresql-42.2.5.jar") \
.appName("PySpark_Postgres_test").getOrCreate()
df = connector.read.format("jbdc") \
.option("url", url) \
.option("dbtable", 'chicago_crime') \
.option("user", user) \
.option("password", password) \
.option("driver", "org.postgresql.Driver") \
.load()
I keep getting the same error:
"An error occurred while calling o358.load.\n:
java.lang.ClassNotFoundException: \nFailed to find data source: jbdc. ...
Maybe the url is not correct?
url = 'jdbc:postgresql://127.0.0.1:5432/postgres'
The database is on port 5432 and has the name postgres. The database is on my localhost but since I am working in a docker container I assumed the correct way would be to enter the ip adress of your laptops localhost 127.0.0.1. If you type localhost it would refer to the localhost of your docker container. Or should I use the IPv4 Address (Wireless Lan .. or wsl).
Anyone knows what's wrong?
ps, one of the commands in my dockerfile is the following:
RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar -P /opt/spark/jars
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/opt/spark/jars/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://host.docker.internal:5432/postgres") \
.option("dbtable", "chicago_crime") \
.option("user", "postgres") \
.option("password", "postgres") \
.option("driver", "org.postgresql.Driver") \
.load()
I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png
I'm using Pyspark V3.2.1 to write on a kafka broker
data_frame \
.selectExpr('CAST(id AS STRING) AS key', "to_json(struct(*)) AS value") \
.write \
.format('kafka') \
.option('topic', topic)\
.option('kafka.bootstrap.servers', 'localhost:9092') \
.mode('append') \
.save()
I'm facing a real engineering issue ! How to ensure having an atomic operation and an idempotent producer ?
Any envisaged solutions, thanks
I have a postgres database on an EC2 machine. Using PySpark on a cluster setup I am trying to write to the postgresDB but am not able to.
The Postgres Database has a DB: my_db, followed by a table events.
My PySpark code is:
df.write.format("jdbc") \
.option("url", "jdbc:postgresql://ec2-xxxxx.compute-1.amazonaws.com:543x/my_db") \
.option("dbtable", "events") \
.option("user", "xxx") \
.option("password", "xxx") \
.option("driver", "org.postgresql.Driver").mode('append').save()
When executing I receive this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o69.save. : org.postgresql.util.PSQLException: ERROR: relation "events" already exists
It seems that it creates a new table when I execute spark-submit, how to solve this error?
I need to connect to postgresql DB available in on-prem subscription to my Azure Databricks notebook(cloud subscription) and load a postgresql table to a sparkDataFrame, please let me know if anybody has worked on it, i know i can run the below pyspark code to read the data from a table but need help on how to establish a connection from my Azure Databricks notebook to postgresql DB available in on-prem subscription.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()