How to execute SQL truncate table in pysark - pyspark

Am trying to truncate an Oracle table using pyspark using the below code
truncatesql = """ truncate table mytable """
mape=spark.read \
.format("jdbc") \
.option("url", DB_URL) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", truncatesql) \
.load()
but it keeps throwing java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended how can I truncate a table using direct SQL query ?

Try by wrapping your query with an alias.
Example:
truncatesql = """(truncate table mytable)e"""
mape=spark.read \
.format("jdbc") \
.option("url", DB_URL) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", truncatesql) \
.load()

Related

Spark Shuffle Read and Shuffle Write Increasing in Structured Screaming

I have been running spark-structured streaming with Kafka for the last 23 hours. And I could see Shuffle Read and Shuffle Write Increasing drastically and finally, the driver stopped due to"out of memory".
Data Pushing to Kafak is 3 json per second and Spark streaming processingTime='30 seconds'
spark = SparkSession \
.builder \
.master("spark://spark-master:7077") \
.appName("demo") \
.config("spark.executor.cores", 1) \
.config("spark.cores.max", "4") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir", "hdfs://172.30.7.36:9000/user/hive/warehouse") \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.executor.memory", '1g') \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.driver.memory", '2g') \
.config("spark.sql.caseSensitive", "true") \
.config("spark.sql.shuffle.partitions", 8) \
.enableHiveSupport() \
.getOrCreate()
CustDf \
.writeStream \
.queryName("customerdatatest") \
.format("delta") \
.outputMode("append") \
.trigger(processingTime='30 seconds') \
.option("mergeSchema", "true") \
.option("checkpointLocation", "/checkpoint/bronze_customer/") \
.toTable("bronze.customer")
I am expecting this straming should be run alteast for 1 month continuously.
Spark is transforming json (Flattening the json) and insert into the delta table.
Please help me on this. weather i misssed any configuration ?

pyspark insert failed using spark.read method

def QueryDB(sqlQuery):
jdbcUrl = mssparkutils.credentials.getSecret("param1","DBJDBCConntring","param3")
spark=SparkSession.builder.appName("show results").getOrCreate()
dbcdf = (spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", sqlQuery)
.load()
)
return jdbcdf
df= QueryDB("INSERT INTO schema.table1 (column1, column2) output inserted.column1 values('one', 'two')")
df.show()
the notebook runs without any error but no rows are inserted. any suggestion or sample code to insert into table.
spark.read.format("jdbc") is to read JDBC. If you want to insert data to JDBC you'd want something like this
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()

databricks: truncate table in redshift

I want to delete redshift table within databricks notebook. I can query the table but when I try to delete it, I get an error message.
truncate_traffic_sql = "TRUNCATE TABLE my_table"
spark.read \
.format("com.databricks.spark.redshift") \
.option("url", f"jdbc:postgresql://fw-rs-qa.xxxxx.us-east-1.redshift.amazonaws.com:5439/mydb?user={credentials['user']}&password={credentials['password']}") \
.option("query", truncate_traffic_sql) \
.option("tempdir", "s3a://my-bucket/Fish/tmp") \
.option("forward_spark_s3_credentials", "true") \
.load()
Caused by: java.sql.SQLException: Exception thrown in awaitResult:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "TABLE"

Pivot a streaming dataframe pyspark

I have a streaming dataframe from kafka and I need to pivot two columns. This is the code I'm currently using:
streaming_df = streaming_df.groupBy('Id','Date')\
.pivot('Var')\
.agg(first('Val'))
query = streaming_df.limit(5) \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("stream") \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
`
I recieve the following error:
pyspark.sql.utils.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
pyspark version: 3.1.1
any ideas how to implement pivot with a streaming dataframe ?
The pivot transformation is not supported by Spark when applying to streaming data.
What you can do is to use the foreachBatch with a user defined function like this:
def apply_pivot(stream_df, batch_id):
# Here your pivot transformation
stream_df \
.groupBy('Id','Date') \
.pivot('Var') \
.agg(first('Val')) \
.write \
.format('memory') \
.outputMode('append') \
.queryName("stream")
query = streaming_df.limit(5) \
.writeStream \
.foreachBatch(apply_pivot) \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
Let me know if it helped you!

Pyspark Dataframe Insert with overwrite and having more then one partitions

I'm having dataframe with 2 partitions, inserting into postgres table with overwrite method.
df.write \
.format("jdbc") \
.option("driver", POSTGRESQL_DRIVER) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", "test_table") \
.mode("overwrite") \
.save()
Partitions Vector : (0, 1)
Partition 0 will insert first followed by partition 1, here partition 0 records are over writing in table. Only partition 1 records are available.
How can i insert or save two partitions without overwrite of previous partitions ?
I can see below two possible workarounds for this problem.
1) As part of the write provide one more option to truncate the table and then append so that old data will be truncated and new data frame will be appended. Every time you will have only new dataset this way.
df.write \
.format("jdbc") \
.option("driver", POSTGRESQL_DRIVER) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", "test_table") \
.option("truncate", True) \
.mode("append") \
.save()
2) As part of the spark 2.3 we got new option where we can truncate only specific partition instead of all partitions. If you are using the latest version of spark, you can give try of this feature.
https://issues.apache.org/jira/browse/SPARK-20236
Hope this helps.