spark join operation for two data frame - pyspark

when df1 and df2 has the same rows and
df1 and df2 has no duplicated value
what is the complexity for join operation df1.join(df2)?
my quess is to take O(n^2)
and is it possible to sort both the data frame and make it better performance?
if it's not what is the way to make a join faster im pyspark?

Even if df1 and df2 have same set of rows and if they are not partitioned, for joining them spark has to partition both the data frames on the join key. For spark 2.3 onwards, sort-merge joins the default join workhorse which would require both the data frames to be partitioned and sorted by the join key and then the join is performed. Both the data frames also have to be colocated for sort-merge join.
and is it possible to sort both the data frame and make it better performance? if it's not what is the way to make a join faster im pyspark?
Yes, if you see that a particular data frame is used again and again in a join using the same join key then you can repartition the data frame on the join key and cache it for further use. Please refer below link for more details
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

Related

Join dataframe with huge table join in oracle

How do I join the data frame with oracle JDBC?
The schema for data frame is acct_n0,stmt_st_dt,stmt_end_dt,posn_as_of_dt.
We have to take the above posn_as_of_dt from the data frame and join it with a combination of dimension and fact_table in oracle and pull the balances from the fact_table. This combination is giving around 7M records whereas the data frame has less than 50 records. The output count should be the same as the data frame count. I tried to create the data frame by using spark read jdbc with db table as "select dim.acct_key,fact.balances,fact.posn_as_of_dt from dim_table dim,fact_table fact where dim.acct_no=fact.acct_no" but this is getting struck while joining with dataframe. Any other thoughts to speed up this join?
Basically what am i after is, is there any way i can directly take these dataframe and join with oracle and pull only matching records out?

why my no.of rows are exploding after BroadCast Hash Join

somehow no.of output rows are exploding after BroadCast Hash Join. right side table has unique row on join column. not sure what causing exploding rows. attaching SQL plan here

Join relatively small table with large table in Spark 2.1

I am currently working on updating a table based on its existence on another table:
Ex:
Dataset A (relatively small, 300k of rows): DepartmentId, EmployeeId, Salary, Error
Dataset B (relatively huge, millions of rows): DepartmentId, EmployeeId, Salary
The logic is:
1. If A's (DepartmentId, EmployeeId) pair exists in B, then update A's salary with B's salary
2. Otherwise, write a message to A's error field
The solution I have now is doing a left outer join on A with B. Is there any other better practices for this type of problem?
Thank you in advance!
For better performance, you can use broadcast hash join as mention here by #Ram Ghadiyaram
The broadcasted dataframe will be distributed in all the partition which increases the performance in joining.
DataFrame join optimization - Broadcast Hash Join
Hope this helps!

Join Multiple Data frames in Spark

I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job

Solving data skew in SparkSQL

I have a SPARK SQL code that joins a fact table and dimension table. Join condition leads to data skew as one of the result combination will have huge data compared to others. In scala , I think this can be solved with
partitionBy(new org.apache.spark.HashPartitioner(160))
But this works only on RDD and not on schemaRDD.
Is there an equivalent to this ?
Here is how my code looks like
sqlContext.sql("select product_category,shipment_item_id,shipment_amount from shipments_fact f left outer join product_category pc on f.category_code = pc.category_code")
Request help...