Save a DataFrame in postgresql - postgresql

I`m trying to save my data frame in *.orc using jdbc in postgresql. I have an intermediate table created in my localhost on the server I use, but the table is not saved in postgresql.
I would like to find out what extensions postgresql works with (you may not be able to create a *.orc table in it), and that it accepts - a Dataset or sql query from the created table.
I'm using spark.
Properties conProperties = new Properties();
conProperties.setProperty("driver", "org.postgresql.Driver");
conProperties.setProperty("user", srgs[2]);
conProperties.setProperty("password", args[3]);
finalTable.write()
.format("orc")
.mode(SaveMode.Overwrite)
.jdbc(args[1], "dataTable", conProperties);
spark-submit --class com.pack.Main --master yarn --deploy-mode cluster /project/build/libs/mainC-1.0-SNAPSHOT.jar ha-cluster?jdbc:postgresql://MyHostame:5432/nameUser&user=nameUser&password=passwordUser

You cannot keep the .orc format in Postgres. You wouldn't want that either.
See updated write below
finalTable.write
.format("jdbc")
.option("url", srgs[1])
.option("dbtable", "dataTable")
.option("user", srgs[2])
.option("password", args[3])
.save()

Related

How to make 'First Row' the Header when saving data to SQL Server with Databricks

Can someone let me know make the first row the header when saving to SQL Server with Databricks
I am currently using the following code to upload / save to SQL in Azure
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
df = spark.read.csv("/mnt/lake/RAW/OptionsetMetadata.csv")
df.write.mode("overwrite") \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", 'UpdatedProducts')\
.save()
The table looks like the following after saving:
JDBC driver creates the table according to the schema. It looks like that you're reading from the CSV file, and don't specify .option("header", "true") when reading. Just add this option to your read operation.

how to connect to mongodb Atlas from databricks cluster using pyspark

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

Unable to write Spark's DataFrame to hive using presto

I'm writing some code to save a DataFrame to a hive database using presto
df.write.format("jdbc")
.option("url", "jdbc:presto://myurl/hive?user=user/default")
.option("driver","com.facebook.presto.jdbc.PrestoDriver")
.option("dbtable", "myhivetable")
.mode("overwrite")
.save()
this actually must work , but this actually raises an exception
java.lang.IllegalArgumentException: Can't get JDBC type for array<string> 

Error when importing into a spark dataframe a mssql table named as an integer

I have a table in MSSQL named as dbo.1table and I need to convert it into a dataframe and later on save it as an avro file but I can't even load it as dataframe.
I tested my code with tables named with characters a-z and it works, I tried to convert the table name "toString()" and nothing has worked so far.
I expect to have a dataframe and then save it as an avro file.
Instead I have the folloiwng error:
val DFDimAccountOperator = spark.read.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", connection)
.option("dbtable", "dbo.1table")
.option("user", userId)
.option("password", pwd).load()
DFDimAccountOperator.write.format("avro").save("conversionTypes/testinAVro13")
Exception in thread "main" com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near '.1'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1621)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2935)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at com.aon.ukbi.TomsExample$.runJdbcDatasetExample(TomsExample.scala:27)
at com.aon.ukbi.TomsExample$.main(TomsExample.scala:16)
connection
For making a connection between MSSQL and Spark you need to add sqljdbc jar into $SPARK_HOME/jars location and restart your spark-shell and paste these line into Spark Shell.
scala> val DFDimAccountOperator = spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .option("url", "jdbc:sqlserver://xxxxxxxx.xxx:1433;database=xx;user=xxxxxx;password=xxxxxx") .option("dbtable", "xxxxxx").load()
Restart and Re-run Code(replace XXXX with appropiate value)
after this, you can write your data frame whichever format you want.
DFDimAccountOperator.write.format("avro").save("conversionTypes/testinAVro13")
Hope this helps you let me know if you have further queries related to this if its solve your purpose accept the answer HAppy HAdooop

Write dataframe to Teradata in Spark

I have values in dataframe , and I have created a table structure in Teradata. My requirement is to load dataframe to Teradata. But I am getting error:
I have tried following code :
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","organization.td.intranet")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
I got an error :
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I changed url option to make it similar to jdbc url, and ran following command:
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","jdbc:teradata//organization.td.intranet,CHARSET=UTF8,TMODE=ANSI,user=G01159039")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
Still i am getting error:
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I have included following jars:
with --jars option
tdgssconfig-16.10.00.03.jar
terajdbc4-16.10.00.03.jar
teradata-connector-1.2.1.jar
Version of Teradata 15
Spark version 2
Change the jdbc_url and dbtable to the following
.option("url","jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb)
.option("dbtable","emp")
Also note in teradata, there are no row locks, so the above will create a table lock. i.e. it will not be efficient - parallel writes from sparkJDBC are not possible.
Native tools of teradata - fastloader /bteq combinations will work.
Another option - that requires a complicated set up is Teradata Query Grid - this is super fast - Uses Presto behind the scene.
I found actual issue.
JDBC Url should be in following form :-
val jdbcUrl = s"jdbc:teradata://${jdbcHostname}/database=${jdbcDatabase},user=${jdbcUsername},password=${jdbcPassword}"
It was causing exception , because I didnt supply username and password.
Below is code useful while reading data from Teradata table,
df = (spark.read.format("jdbc").option("driver", "com.teradata.jdbc.TeraDriver")
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb")
.option("dbtable", "(select * from td_s_zm_brainsdb.emp) AS t")
.option("user", "userid")
.option("password", "password")
.load())
This will create data frame in Spark.
For writing data back to database below is statement,
Saving data to a JDBC source
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()