Pyspark delete row from PostgreSQL - postgresql

How can PySpark remove rows in PostgreSQL by executing a query such as DELETE FROM my_table WHERE day = 3 ?
SparkSQL provides API only for inserting/overriding records. So using a library like psycopg2 could do the job, but it needs to be explicitly compiled on the remote machine, that is not doable for me. Any other suggestions?

Dataframes in Apache Spark are immutable. You can filter out the rows you don't want.
See the documentation.
A simple example could be:
df = spark.jdbc("conn-url", "mytable")
df.createOrReplaceTempView("mytable")
df2 = spark.sql("SELECT * FROM mytable WHERE day != 3")
df2.collect()

The only solution that works so far is to install psycopg2 to spark master node and call queries like a regular python would do. Adding that library as py-files didn't work out for me

Related

Spark SQL gives different result when I run self join

I am having the below like table as parquet and I read it as a Dataframe and processed it using spark SQL. I am running this as a spark job using EMR cluster.
Employee table
EmployeeName
employeeId
JoiningDate
ResignedDate
Salary
Kkkk
32
12/24/2021
10/03/2022
1000
bbbb
33
11/23/2002
10/21/2003
2000
aaaa
45
10/25/2003
07/24/2013
3000
assd
42
03/09/2006
11/28/2016
4000
I am having the self join inside spark sql like below(please don't look at the logic inside SQL),
val df = spark.sql("Select e.employee from employeeTable e join employeeTable f
on (to_date(e.JoiningDate,'MM/dd/yyyy') < to_date(f.ResignedDate,'MM/dd/yyyy')) AND (to_date(e.ResignedDate,'MM/dd/yyyy') > to_date(f.ResignedDate,'MM/dd/yyyy')) and e.salary > 4000")
I am running this spark job via EMR cluster, by enabling the spark.sql.analyzer.failambiguousselfjoin as false. But, the query is not working properly means it returns the wrong output in spark job. And, When I made spark.sql.analyzer.failambiguousselfjoin as true, some times it returns the correct result.
But, the query is working fine in spark-shell all the time and returns the expected result. Did anyone face these kind of issue?
Is it advisable to write the self join queries in spark SQL? Or is it better to write it as a Spark Dataframe? Please help me to resolve this issue?

Pyspark - Looking to apply SQL queries to pyspark dataframes

Disclaimer: I'm very new to pyspark and this question might not be appropriate.
I've seen the following code online:
# Get the id, age where age = 22 in SQL
spark.sql("select id, age from swimmers where age = 22").show()
Now, I've tried to pivot using pyspark with the following code:
complete_dataset.createOrReplaceTempView("df")
temp = spark.sql("SELECT core_id from df")
This is the error I'm getting:
'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I figured this would be straightforward but I can't seem to find the solution. Is this posible to do in pyspark?
NOTE: I am on an EMR Cluster using a Pyspark notebook.
In pyspark you can read MySQL table (assuming that you are using MySQL) and create dataframe.
jdbc_url = 'jdbc:mysql://{}:{}#{}/{}?zeroDateTimeBehavior=CONVERT_TO_NULL'.format(
'usrname',
'password',
'host',
'db',
)
table_df = sql_ctx.read.jdbc(url=jdbc_url, table='table_name').select("column_name1", "column_name2")
Where table_df is the dataframe. The you can do required operations on dataframe like filter etc.
table_df.filter(table_df.column1 == 'abc').show()

Spark DataFrame turns empty after writing to table

I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.

Update Table Hive Using Spark Scala

I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?
I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.

Spark SQL - pyspark api vs sql queries

All,
I have question regarding writing SparkSQL program, is there difference of performance between writing
SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")
using pyspark Api : df.select("col1,col2").distinct().count().
I would like to hear out the suggestion and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables to Py-Spark program
I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.