pyspark insert failed using spark.read method - pyspark

def QueryDB(sqlQuery):
jdbcUrl = mssparkutils.credentials.getSecret("param1","DBJDBCConntring","param3")
spark=SparkSession.builder.appName("show results").getOrCreate()
dbcdf = (spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", sqlQuery)
.load()
)
return jdbcdf
df= QueryDB("INSERT INTO schema.table1 (column1, column2) output inserted.column1 values('one', 'two')")
df.show()
the notebook runs without any error but no rows are inserted. any suggestion or sample code to insert into table.

spark.read.format("jdbc") is to read JDBC. If you want to insert data to JDBC you'd want something like this
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()

Related

Can I "branch" stream into many and write them in parallel in pyspark?

I am receiving Kafka stream in pyspark. Currently I am grouping it by one set of fields and writing updates to database:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic)
...
df = df \
.groupBy("myfield1") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
query = df \
.writeStream \
.outputMode("update") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df, epoch)) \
.start()
query.awaitTermination()
Can I take the same chain in the middle and create another grouping like
df2 = df \
.groupBy("myfield2") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
and write it's ooutput into different place in parallel?
Where to call writeStream and awaitTermination?
Yes, you can branch a Kafka input stream into as many streaming queries as you like.
You need to consider the following:
query.awaitTermination is a blocking method, which means whatever code you are writing after this method will not be executed until this query gets terminated.
Each "branched" streaming query will run in parallel and is it important that you define a checkpoint location in each of their writeStream calls.
Overall, your code needs to have the following structure:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic) \
.[...]
# note that I changed the variable name to "df1"
df1 = df \
.groupBy("myfield1") \
.[...]
df2 = df \
.groupBy("myfield2") \
.[...]
query1 = df1 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc1") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df1, epoch)) \
.start()
query2 = df2 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc2") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df2, epoch)) \
.start()
spark.streams.awaitAnyTermination
Just an additional remark: In the code you are showing, you are overwriting df, so the derivation of df2 might not get you the results as you were intended.

How to execute SQL truncate table in pysark

Am trying to truncate an Oracle table using pyspark using the below code
truncatesql = """ truncate table mytable """
mape=spark.read \
.format("jdbc") \
.option("url", DB_URL) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", truncatesql) \
.load()
but it keeps throwing java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended how can I truncate a table using direct SQL query ?
Try by wrapping your query with an alias.
Example:
truncatesql = """(truncate table mytable)e"""
mape=spark.read \
.format("jdbc") \
.option("url", DB_URL) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", truncatesql) \
.load()

Writing SQL table directly to file in Scala

Team,
I'm working on Azure databricks, I'm able to write a dataframe to CSV file using the following option:
df2018JanAgg
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("dbfs:/FileStore/output/df2018janAgg.csv")
but I'm seeking an option to write data directly from SQL table to CSV in Scala.
Can someone please let me know if such options exist.
Thanks,
Srini
Yes data could be directly loaded between a sql table to Datafame and vice-versa. Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
//JDBC -> DataFarme -> CSV
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("dbfs:/FileStore/output/df2018janAgg.csv")
//DataFarme -> JDBC
df.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()

Pyspark Dataframe Insert with overwrite and having more then one partitions

I'm having dataframe with 2 partitions, inserting into postgres table with overwrite method.
df.write \
.format("jdbc") \
.option("driver", POSTGRESQL_DRIVER) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", "test_table") \
.mode("overwrite") \
.save()
Partitions Vector : (0, 1)
Partition 0 will insert first followed by partition 1, here partition 0 records are over writing in table. Only partition 1 records are available.
How can i insert or save two partitions without overwrite of previous partitions ?
I can see below two possible workarounds for this problem.
1) As part of the write provide one more option to truncate the table and then append so that old data will be truncated and new data frame will be appended. Every time you will have only new dataset this way.
df.write \
.format("jdbc") \
.option("driver", POSTGRESQL_DRIVER) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", "test_table") \
.option("truncate", True) \
.mode("append") \
.save()
2) As part of the spark 2.3 we got new option where we can truncate only specific partition instead of all partitions. If you are using the latest version of spark, you can give try of this feature.
https://issues.apache.org/jira/browse/SPARK-20236
Hope this helps.

Create Index thru SPARK for JDBC

I am trying to create Index on Postgres Table thru Spark and the code is as below:
val df3 = sqlContext.read.format("jdbc")
.option("url", "jdbc:postgresql://URL")
.option("user", "user")
.option("password", "password")
.option("dbtable", "(ALTER TABLE abc.test1 ADD PRIMARY KEY (test))as t")
.option("driver", "org.postgresql.Driver")
.option("lowerBound", 1L)
.option("upperBound", 10000000L)
.option("numPartitions", 100)
.option("fetchSize", "1000000")
.load()
The error is
Exception in thread "main" org.postgresql.util.PSQLException: ERROR: syntax error at or near "TABLE"
Just wondering can we do that or the above Data frame is wrong. Appreciate your help.