Spark SQL gives different result when I run self join - scala

I am having the below like table as parquet and I read it as a Dataframe and processed it using spark SQL. I am running this as a spark job using EMR cluster.
Employee table
EmployeeName
employeeId
JoiningDate
ResignedDate
Salary
Kkkk
32
12/24/2021
10/03/2022
1000
bbbb
33
11/23/2002
10/21/2003
2000
aaaa
45
10/25/2003
07/24/2013
3000
assd
42
03/09/2006
11/28/2016
4000
I am having the self join inside spark sql like below(please don't look at the logic inside SQL),
val df = spark.sql("Select e.employee from employeeTable e join employeeTable f
on (to_date(e.JoiningDate,'MM/dd/yyyy') < to_date(f.ResignedDate,'MM/dd/yyyy')) AND (to_date(e.ResignedDate,'MM/dd/yyyy') > to_date(f.ResignedDate,'MM/dd/yyyy')) and e.salary > 4000")
I am running this spark job via EMR cluster, by enabling the spark.sql.analyzer.failambiguousselfjoin as false. But, the query is not working properly means it returns the wrong output in spark job. And, When I made spark.sql.analyzer.failambiguousselfjoin as true, some times it returns the correct result.
But, the query is working fine in spark-shell all the time and returns the expected result. Did anyone face these kind of issue?
Is it advisable to write the self join queries in spark SQL? Or is it better to write it as a Spark Dataframe? Please help me to resolve this issue?

Related

Pyspark - Looking to apply SQL queries to pyspark dataframes

Disclaimer: I'm very new to pyspark and this question might not be appropriate.
I've seen the following code online:
# Get the id, age where age = 22 in SQL
spark.sql("select id, age from swimmers where age = 22").show()
Now, I've tried to pivot using pyspark with the following code:
complete_dataset.createOrReplaceTempView("df")
temp = spark.sql("SELECT core_id from df")
This is the error I'm getting:
'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I figured this would be straightforward but I can't seem to find the solution. Is this posible to do in pyspark?
NOTE: I am on an EMR Cluster using a Pyspark notebook.
In pyspark you can read MySQL table (assuming that you are using MySQL) and create dataframe.
jdbc_url = 'jdbc:mysql://{}:{}#{}/{}?zeroDateTimeBehavior=CONVERT_TO_NULL'.format(
'usrname',
'password',
'host',
'db',
)
table_df = sql_ctx.read.jdbc(url=jdbc_url, table='table_name').select("column_name1", "column_name2")
Where table_df is the dataframe. The you can do required operations on dataframe like filter etc.
table_df.filter(table_df.column1 == 'abc').show()

Pyspark delete row from PostgreSQL

How can PySpark remove rows in PostgreSQL by executing a query such as DELETE FROM my_table WHERE day = 3 ?
SparkSQL provides API only for inserting/overriding records. So using a library like psycopg2 could do the job, but it needs to be explicitly compiled on the remote machine, that is not doable for me. Any other suggestions?
Dataframes in Apache Spark are immutable. You can filter out the rows you don't want.
See the documentation.
A simple example could be:
df = spark.jdbc("conn-url", "mytable")
df.createOrReplaceTempView("mytable")
df2 = spark.sql("SELECT * FROM mytable WHERE day != 3")
df2.collect()
The only solution that works so far is to install psycopg2 to spark master node and call queries like a regular python would do. Adding that library as py-files didn't work out for me

Spark SQL - pyspark api vs sql queries

All,
I have question regarding writing SparkSQL program, is there difference of performance between writing
SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")
using pyspark Api : df.select("col1,col2").distinct().count().
I would like to hear out the suggestion and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables to Py-Spark program
I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.

Join Multiple Data frames in Spark

I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job

Avoid cartesian join in sparkSQL

Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above query performs cartesian product join in sparkSQL and it takes forever to complete.
Can I achieve the same functionality by some other means.
I tried broadcasting the smaller RDD
EDIT:
spark version 1.4.1
No. of executors 2
Memmory/executor 5g
No. of cores 5
You can not avoid Cartesian product as spark sql donot support Non-equi
links