pyspark cartesian join : renaming duplicate columns - pyspark

I have a pyspark dataframe , and i want to perform cartesian join on itself.
I used below function in pyspark
# Cross Join
m_f_1 = m_f_0.withColumnRenamed('value', 'value_x').crossJoin(m_f_0.withColumnRenamed('value', 'value_y')).drop(m_f_0.area).drop(m_f_0.id)
The problem I am facing is there is one more column names type and I want it to renamed while performing cross join. How to do it?
m_f_0 dataframe is :
id value area type
1: 100003 66007 Unknown mo
2: 100011 81716 Unknown mo
3: 100011 68028 Unknown mo
4: 100018 48358 Unknown mo
Output I expect in below format after results of crossjoin.
id value_x value_y type_x type_y
1:
2:
3:

Try something like this:
import pyspark.sql.functions as F
m_f_x = m_f_0.select([F.col(c).alias('%s_x'%(c)) for c in list(m_f_0.columns)])
m_f_y = m_f_0.select([F.col(c).alias('%s_y'%(c)) for c in list(m_f_0.columns)])
m_f_1 = (m_f_x.crossJoin(m_f_y).drop(m_f_x.area_x)
.drop(m_f_x.id_x))

Related

Join column to a new dataframe, but only when customerID exists

I want to give test a new column, named customerCategory. If the new customer is already in test, I want to give him the same value as in train. Otherwise I want to fill the column with WHITE.
What I have already tried (Throws an error):
import pyspark.sql.functions as F
test = test.withColumn("customerCategory",
F.when(F.col("customerID")!=train["customerID"], "WHITE")\
.otherwise(train["customerCategory"]))
THis is more a Join and Coalesce requirement in sql terms. An example would be like below.
test=test.join(train,test.customerID == train.customerID,"left_outer").drop(train.customerID)
test= test.withColumn('customerCategory', F.coalesce(F.col("customerCategory"),F.lit("WHITE")))

Trying to join tables and getting "Resolved attribute(s) columnName#17 missing from ..."

I'm trying to join two tables and getting a frustrating series of errors:
If I try this:
pop_table = mtrips.join(trips, (mtrips["DOLocationID"] == trips["PULocationID"]))
Then I get this error:
Resolved attribute(s) PULocationID#17 missing from PULocationID#2508,
If I try this:
pop_table = mtrips.join(trips, (col("DOLocationID") == col("PULocationID")))
I get this error:
"Reference 'DOLocationID' is ambiguous, could be: DOLocationID, DOLocationID.;"
If I try this:
pop_table = mtrips.join(trips, col("mtrips.DOLocationID") == col("trips.PULocationID"))
I get this error:
"cannot resolve '`mtrips.DOLocationID`' given input columns: [DOLocationID]
When I search on SO for these errors it seems like every post is telling me to try something that I've already tried and isn't working.
I don't know where to go from here. Help appreciated!
It looks like this problem. There is some ambiguity in the names.
Are you deriving one of the dataframes from another one? In that case, use withColumnRenamed() to rename the 'join' columns in the second dataframe before you do the join operation.
This is pretty evidante that the issue with column name in both the dataframe.
1. When you have all different columns in both the dataframe , expect join key column is same name in both DF, use this
**`df = df.join(df_right, 'join_col_which_is_same_in_both_df', 'left')`**
2. When your join column is different name in both the dataframe - This join will take both the column i.e col1 and col2 in the joined df
**`df = df.join(df_right, df.col1 == df_right.col2, 'left')`**

Join two dataframe using Spark Scala

I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .
You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)

Dynamic boolean join in pyspark

I have two pyspark dataframe with same schema as below -
df_source:
id, name, age
df_target:
id,name,age
"id" is primary column in both the tables and rest are attribute columns
i am accepting primary and attribute column list from user as below-
primary_columns = ["id"]
attribute_columns = ["name","age"]
I need to join above two dataframes dynamically as below -
df_update = df_source.join(df_target, (df_source["id"] == df_target["id"]) & ((df_source["name"] != df_target["name"]) | (df_source["age"] != df_target["age"])) ,how="inner").select([df_source[col] for col in df_source.columns])
how can i achieve this join condition dynamically in pyspark since number of attribute and primary key columns can change as per the user input? Please help.
IIUC, you just can achieve the desired output using an inner join on the primary_columns and a where clause that loops over the attribute_columns.
Since the two DataFrames have the same column names, use alias to differentiate column names after the join.
from functools import reduce
from pyspark.sql.functions import col
df_update = df_source.alias("s")\
.join(df_target.alias("t"), on=primary_columns, how="inner")\
.where(
reduce(
lambda a, b: a|b,
[(col("s."+c) != col("t."+c) for c in attribute_columns]
)\
)
.select("s.*")
Use reduce to apply the bitwise OR operation on the columns in attribute_columns.

Minus logic implementation not working with spark/scala

Minus Logic in Hive:
The below (Hive)query will return only records available in left side table ( Full_Table ft), but not in both.
Select ft.* from Full_Table ft left join Stage_Table stg where stg.primary_key1 IS null and stg.primary_key2 IS null
I tried to implement the same in spark/scala using following method ( To support both primary key and composite key ) , But joined result set does not have column from right table, because of that not able to apply stg.primary_key2 IS null condition in joined result set.
ft.join(stg,usingColumns, “left_outer”) // used seq to support composite key column join
Please suggest me how to implement minus logic in spark scala.
Thanks,
Saravanan
https://www.linkedin.com/in/saravanan303/
If your tables have the same columns you can use except method from DataSet:
fullTable.except(stageTable)
If they don't have, but you are interested only on subset of columns that exists in both tables you can first select those column using select transformation and than use except:
val fullTableSelectedColumns = fullTable.select(c1,c2,c3)
val stageTableSelectedColumns = stageTable.select(c1,c2,c3)
fullTableSelectedColumns.except(stageTableSelectedColumns)
On other case, you can use join and filter transformations:
fullTable
.join(stageTable, fullTable("primary_key") === stageTable("primary_key"), "left")
.filter(stageTable("primary_key1").isNotNull)