Unable to write Spark's DataFrame to hive using presto - scala

I'm writing some code to save a DataFrame to a hive database using presto
df.write.format("jdbc")
.option("url", "jdbc:presto://myurl/hive?user=user/default")
.option("driver","com.facebook.presto.jdbc.PrestoDriver")
.option("dbtable", "myhivetable")
.mode("overwrite")
.save()
this actually must work , but this actually raises an exception
java.lang.IllegalArgumentException: Can't get JDBC type for array<string> 

Related

How to make 'First Row' the Header when saving data to SQL Server with Databricks

Can someone let me know make the first row the header when saving to SQL Server with Databricks
I am currently using the following code to upload / save to SQL in Azure
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
df = spark.read.csv("/mnt/lake/RAW/OptionsetMetadata.csv")
df.write.mode("overwrite") \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", 'UpdatedProducts')\
.save()
The table looks like the following after saving:
JDBC driver creates the table according to the schema. It looks like that you're reading from the CSV file, and don't specify .option("header", "true") when reading. Just add this option to your read operation.

How can I fix an error of spark application?

I'm loading data from hdfs to clickhouse with spark job and get the error DB::Exception: Too many parts (306). Merges are processed much slower than inserts. (TOO_MANY_PARTS) (version 22.3.44 (official build))
Data is in parquet, volume 34 GB.
Path to parquet "hdfs://host:8020/user/stat/year=2022/month=1/day=1/101.parq".
My settings like so:
val df = spark.read.parquet("hdfs://host:8020/user/stat/year=2022/")
df.write
.format("jdbc")
.mode("append")
.option("driver","cc.blynk.clickhouse.ClickHouseDriver")
.option("url", "jdbc:clickhouse://host:8123/default")
.option("user", "login")
.option("password", "pass")
.option("dbtable", "table")
.save()
I'm new at Scala & Spark
so thanks for any advice
Use async_insert
https://clickhouse.com/docs/en/operations/settings/settings/#async-insert
.option("async_insert", "1")
or Engine=Buffer tables, for older ClickHouse version
https://clickhouse.com/docs/en/engines/table-engines/special/buffer/

Save a DataFrame in postgresql

I`m trying to save my data frame in *.orc using jdbc in postgresql. I have an intermediate table created in my localhost on the server I use, but the table is not saved in postgresql.
I would like to find out what extensions postgresql works with (you may not be able to create a *.orc table in it), and that it accepts - a Dataset or sql query from the created table.
I'm using spark.
Properties conProperties = new Properties();
conProperties.setProperty("driver", "org.postgresql.Driver");
conProperties.setProperty("user", srgs[2]);
conProperties.setProperty("password", args[3]);
finalTable.write()
.format("orc")
.mode(SaveMode.Overwrite)
.jdbc(args[1], "dataTable", conProperties);
spark-submit --class com.pack.Main --master yarn --deploy-mode cluster /project/build/libs/mainC-1.0-SNAPSHOT.jar ha-cluster?jdbc:postgresql://MyHostame:5432/nameUser&user=nameUser&password=passwordUser
You cannot keep the .orc format in Postgres. You wouldn't want that either.
See updated write below
finalTable.write
.format("jdbc")
.option("url", srgs[1])
.option("dbtable", "dataTable")
.option("user", srgs[2])
.option("password", args[3])
.save()

Error when importing into a spark dataframe a mssql table named as an integer

I have a table in MSSQL named as dbo.1table and I need to convert it into a dataframe and later on save it as an avro file but I can't even load it as dataframe.
I tested my code with tables named with characters a-z and it works, I tried to convert the table name "toString()" and nothing has worked so far.
I expect to have a dataframe and then save it as an avro file.
Instead I have the folloiwng error:
val DFDimAccountOperator = spark.read.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", connection)
.option("dbtable", "dbo.1table")
.option("user", userId)
.option("password", pwd).load()
DFDimAccountOperator.write.format("avro").save("conversionTypes/testinAVro13")
Exception in thread "main" com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near '.1'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1621)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2935)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at com.aon.ukbi.TomsExample$.runJdbcDatasetExample(TomsExample.scala:27)
at com.aon.ukbi.TomsExample$.main(TomsExample.scala:16)
connection
For making a connection between MSSQL and Spark you need to add sqljdbc jar into $SPARK_HOME/jars location and restart your spark-shell and paste these line into Spark Shell.
scala> val DFDimAccountOperator = spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .option("url", "jdbc:sqlserver://xxxxxxxx.xxx:1433;database=xx;user=xxxxxx;password=xxxxxx") .option("dbtable", "xxxxxx").load()
Restart and Re-run Code(replace XXXX with appropiate value)
after this, you can write your data frame whichever format you want.
DFDimAccountOperator.write.format("avro").save("conversionTypes/testinAVro13")
Hope this helps you let me know if you have further queries related to this if its solve your purpose accept the answer HAppy HAdooop

Write dataframe to Teradata in Spark

I have values in dataframe , and I have created a table structure in Teradata. My requirement is to load dataframe to Teradata. But I am getting error:
I have tried following code :
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","organization.td.intranet")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
I got an error :
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I changed url option to make it similar to jdbc url, and ran following command:
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","jdbc:teradata//organization.td.intranet,CHARSET=UTF8,TMODE=ANSI,user=G01159039")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
Still i am getting error:
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I have included following jars:
with --jars option
tdgssconfig-16.10.00.03.jar
terajdbc4-16.10.00.03.jar
teradata-connector-1.2.1.jar
Version of Teradata 15
Spark version 2
Change the jdbc_url and dbtable to the following
.option("url","jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb)
.option("dbtable","emp")
Also note in teradata, there are no row locks, so the above will create a table lock. i.e. it will not be efficient - parallel writes from sparkJDBC are not possible.
Native tools of teradata - fastloader /bteq combinations will work.
Another option - that requires a complicated set up is Teradata Query Grid - this is super fast - Uses Presto behind the scene.
I found actual issue.
JDBC Url should be in following form :-
val jdbcUrl = s"jdbc:teradata://${jdbcHostname}/database=${jdbcDatabase},user=${jdbcUsername},password=${jdbcPassword}"
It was causing exception , because I didnt supply username and password.
Below is code useful while reading data from Teradata table,
df = (spark.read.format("jdbc").option("driver", "com.teradata.jdbc.TeraDriver")
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb")
.option("dbtable", "(select * from td_s_zm_brainsdb.emp) AS t")
.option("user", "userid")
.option("password", "password")
.load())
This will create data frame in Spark.
For writing data back to database below is statement,
Saving data to a JDBC source
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()