Cannot write to postgres - postgresql

I have a postgres database on an EC2 machine. Using PySpark on a cluster setup I am trying to write to the postgresDB but am not able to.
The Postgres Database has a DB: my_db, followed by a table events.
My PySpark code is:
df.write.format("jdbc") \
.option("url", "jdbc:postgresql://ec2-xxxxx.compute-1.amazonaws.com:543x/my_db") \
.option("dbtable", "events") \
.option("user", "xxx") \
.option("password", "xxx") \
.option("driver", "org.postgresql.Driver").mode('append').save()
When executing I receive this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o69.save. : org.postgresql.util.PSQLException: ERROR: relation "events" already exists
It seems that it creates a new table when I execute spark-submit, how to solve this error?

Related

Pyspark in docker container postgresql dababase connection

I am trying to connect to a postgres database on the localhost:5432 of my computer using pyspark inside a docker container. For this I use VS code. VS code automatically builds and runs the container. This is the code I have:
password = ...
user = ...
url = 'jdbc:postgresql://127.0.0.1:5432/postgres'
spark = SparkSession.builder.config("spark.jars","/opt/spark/jars/postgresql-42.2.5.jar") \
.appName("PySpark_Postgres_test").getOrCreate()
df = connector.read.format("jbdc") \
.option("url", url) \
.option("dbtable", 'chicago_crime') \
.option("user", user) \
.option("password", password) \
.option("driver", "org.postgresql.Driver") \
.load()
I keep getting the same error:
"An error occurred while calling o358.load.\n:
java.lang.ClassNotFoundException: \nFailed to find data source: jbdc. ...
Maybe the url is not correct?
url = 'jdbc:postgresql://127.0.0.1:5432/postgres'
The database is on port 5432 and has the name postgres. The database is on my localhost but since I am working in a docker container I assumed the correct way would be to enter the ip adress of your laptops localhost 127.0.0.1. If you type localhost it would refer to the localhost of your docker container. Or should I use the IPv4 Address (Wireless Lan .. or wsl).
Anyone knows what's wrong?
ps, one of the commands in my dockerfile is the following:
RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar -P /opt/spark/jars
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/opt/spark/jars/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://host.docker.internal:5432/postgres") \
.option("dbtable", "chicago_crime") \
.option("user", "postgres") \
.option("password", "postgres") \
.option("driver", "org.postgresql.Driver") \
.load()

Connect pySpark to aws aurora mysql with certificate

I am running pyspark on AWS Glue and I am trying to connect to an aurora mysql db with a third party jdbc(not the AWS one but J connect). The problem I am facing is that I do not know how to pass the certificate (.pem) so I can successfully connect to that db.
spark = SparkSession.builder.enableHiveSupport() \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.28.jar,mysql:mysql-connector-java:8.0.28") \
.appName($JOB) \
.getOrCreate()
url = "jdbc:mysql://host_name:3306/db_name"
crtf = "tls_ca_cert" # location of the certificate
df = spark.read \
.format("jdbc") \
.option("url", url) \
.option("table", table) \
.option("user", user) \
.option("password", psw) \
.option("ssl", True) \
.option("sslmode", "require") \
.option("SSLServerCertificate", crtf) \
.load()
This is how I am trying to read the db. Obviously I am missing something. Any help would be appreciated!

Is it possible to connect to a postgresql DB and load a postgresql table to a sparkDataFrame from Databricks notebook

I need to connect to postgresql DB available in on-prem subscription to my Azure Databricks notebook(cloud subscription) and load a postgresql table to a sparkDataFrame, please let me know if anybody has worked on it, i know i can run the below pyspark code to read the data from a table but need help on how to establish a connection from my Azure Databricks notebook to postgresql DB available in on-prem subscription.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()

Save a DataFrame in postgresql

I`m trying to save my data frame in *.orc using jdbc in postgresql. I have an intermediate table created in my localhost on the server I use, but the table is not saved in postgresql.
I would like to find out what extensions postgresql works with (you may not be able to create a *.orc table in it), and that it accepts - a Dataset or sql query from the created table.
I'm using spark.
Properties conProperties = new Properties();
conProperties.setProperty("driver", "org.postgresql.Driver");
conProperties.setProperty("user", srgs[2]);
conProperties.setProperty("password", args[3]);
finalTable.write()
.format("orc")
.mode(SaveMode.Overwrite)
.jdbc(args[1], "dataTable", conProperties);
spark-submit --class com.pack.Main --master yarn --deploy-mode cluster /project/build/libs/mainC-1.0-SNAPSHOT.jar ha-cluster?jdbc:postgresql://MyHostame:5432/nameUser&user=nameUser&password=passwordUser
You cannot keep the .orc format in Postgres. You wouldn't want that either.
See updated write below
finalTable.write
.format("jdbc")
.option("url", srgs[1])
.option("dbtable", "dataTable")
.option("user", srgs[2])
.option("password", args[3])
.save()

Write dataframe to Teradata in Spark

I have values in dataframe , and I have created a table structure in Teradata. My requirement is to load dataframe to Teradata. But I am getting error:
I have tried following code :
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","organization.td.intranet")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
I got an error :
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I changed url option to make it similar to jdbc url, and ran following command:
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","jdbc:teradata//organization.td.intranet,CHARSET=UTF8,TMODE=ANSI,user=G01159039")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
Still i am getting error:
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I have included following jars:
with --jars option
tdgssconfig-16.10.00.03.jar
terajdbc4-16.10.00.03.jar
teradata-connector-1.2.1.jar
Version of Teradata 15
Spark version 2
Change the jdbc_url and dbtable to the following
.option("url","jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb)
.option("dbtable","emp")
Also note in teradata, there are no row locks, so the above will create a table lock. i.e. it will not be efficient - parallel writes from sparkJDBC are not possible.
Native tools of teradata - fastloader /bteq combinations will work.
Another option - that requires a complicated set up is Teradata Query Grid - this is super fast - Uses Presto behind the scene.
I found actual issue.
JDBC Url should be in following form :-
val jdbcUrl = s"jdbc:teradata://${jdbcHostname}/database=${jdbcDatabase},user=${jdbcUsername},password=${jdbcPassword}"
It was causing exception , because I didnt supply username and password.
Below is code useful while reading data from Teradata table,
df = (spark.read.format("jdbc").option("driver", "com.teradata.jdbc.TeraDriver")
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb")
.option("dbtable", "(select * from td_s_zm_brainsdb.emp) AS t")
.option("user", "userid")
.option("password", "password")
.load())
This will create data frame in Spark.
For writing data back to database below is statement,
Saving data to a JDBC source
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()