PySpark delete/insert in transaction with JDBC - pyspark

I want to delete a MSSQL table with ID in a PySpark dataframe and then insert into MSSQL table with another dataframe. I have 2 questions:
Is it possible to write sql with dataframe? Something like DELETE mssql_table WHERE ID IN (SELECT ID from some_dataframe)?
Is it possible to make DELETE/INSERT in a transaction?
Thanks

Related

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

Hive Partition Table with Date Datatype via Spark

I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.
There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.

How do I insert a pandas DataFrame to an existing PostgreSQL table?

I have a pandas DataFrame df and a PostgreSQL table my_table. I wish to truncate my_table and insert df (which has columns in the same order) into my_table, without affecting the schema of my_table. How do I do this?
In a rather naive attempt, I tried dropping my_table and then using pandas.DataFrame.to_sql, but this creates a new table with a different schema.
I would manually truncate the table and then simply let Pandas do its job:
con.execute('TRUNCATE my_table RESTART IDENTITY;')
df.to_sql('my_table', con, if_exists='append')