can not join on multiple condition between two dataframes - pyspark

I am trying to applicate jointure between two dataframes.
df_temp_5=df_temp_4 \
.join(df_position_g, cond, "left")
Where
cond1=df_position_g.position_pk==df_keys_position_g.position_pk
cond2=df_position_g.dt_deb_val==df_keys_position_g.max_dt
cond = [cond1 & cond2]
and
df_temp_4 =df_key_hub_instrument\
.join(df_lnk_position,["instrument_pk"], "outer") \
.join(df_key_hub_portefeuille,["portefeuille_pk"], "outer") \
.join(df_lnk_tiers_instrument,["instrument_pk"], "outer") \
.join(df_keys_position_hors_bilan,["position_pk"], "outer") \
.join(df_keys_portefeuille_sigma,["portefeuille_pk"], "outer") \
.join(df_keys_instrument_sigma,["instrument_pk"], "outer") \
.join(df_keys_cotation_sigma,["instrument_pk"], "outer")
Note that df_temp_4 is fine and there is no problem there .
But, I have issue when trying to join with df_temp_5
As an error :
Py4JJavaError: An error occurred while calling o466.join. :
org.apache.spark.sql.AnalysisException: Resolved attribute(s)
max_dt#238 missing from valeur_actuelle#88,montant_coupon_couru_acha
Any help, please ? thanks

You are trying to join df_temp_4 with df_position_g but the join cond mentions df_keys_position_g and df_position_g. This seems to be a mistake.
Also, cond1 and cond2 must have brackets around them because the operator precedence of & is higher than ==.
# This will work as expected
cond1=(df_position_g.position_pk==df_keys_position_g.position_pk)
cond2=(df_position_g.dt_deb_val==df_keys_position_g.max_dt)
cond = [cond1 & cond2]
Alternatively, avoid brackets if & is not mentioned explicitly.
# This will also work as expected
cond1=df_position_g.position_pk==df_keys_position_g.position_pk
cond2=df_position_g.dt_deb_val==df_keys_position_g.max_dt
cond = [cond1, cond2]
Finally, based on the error message shared it looks like there is more than one column called max_dt in df_position_g, hence it cannot identify the right column. It could also be due to this spark bug. So try renaming the columns used in the join condition just to be safe.
df_keys_position_g = df_keys_position_g.withColumnRenamed("position_pk", "keys_position_pk")\
.withColumnRenamed("max_dt", "keys_position_max_dt")

The condition should be
cond = cond1 & cond2
Instead of
cond = [cond1 & cond2]
This is because writing cond with brackets will treat it as a list.

looking at this, source code of DataFrame.join
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]

Related

Chaining joins in Pyspark

I am trying to join multiple dataframes in PySpark by one chained operation. The join key column name is the same in all of them. The code snippet:
columns_summed = [i for i in df_summed.columns if i != "buildingBlock_id"]
columns_concat = [i for i in df_concat.columns if i != "buildingBlock_id"]
columns_indicator = [i for i in df_indicator_fields.columns if i != "buildingBlock_id"]
columns_takeone = [i for i in df_takeone.columns if i != "buildingBlock_id"]
columns_minmax = [i for i in df_minmax.columns if i != "buildingBlock_id"]
df_all_joined = (df_summed.alias("df1").join(df_concat,df_summed.buildingBlock_id == df_concat.buildingBlock_id, "left")
.join(df_indicator_fields,df_summed.buildingBlock_id == df_indicator_fields.buildingBlock_id, "left")
.join(df_takeone,df_summed.buildingBlock_id == df_takeone.buildingBlock_id, "left")
.join(df_minmax,df_summed.buildingBlock_id == df_minmax.buildingBlock_id, "left")
.select("df1.buildingBlock_id", *columns_summed
, *columns_concat
, *columns_indicator
, *columns_takeone
, *columns_minmax
)
)
Now, when I am trying to display the joined dataframe using:
display(df_all_joined)
I'm getting the following error:
AnalysisException: Reference 'df1.buildingBlock_id' is ambiguous, could be: df1.buildingBlock_id, df1.buildingBlock_id.
Why am I getting this error even though I specified where the key column should come from?
You should specify the join columns as an array of strings:
.join(df_concat,['buildingBlock_id'], "left")
If the columns that you are joining on have the same name, this makes sure to drop one of them. In the case of the left join, it drops the column from df_concat.
If you don't do that, you end up with both columns in the joined data frame, thus creating this "Ambiguous" excception.

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

Spark / Scala / SparkSQL dataframes filter issue "data type mismatch"

My probleme is i have a code that gives filter column and values in a list as parameters
val vars = "age IN ('0')"
val ListPar = "entered_user,2014-05-05,2016-10-10;"
//val ListPar2 = "entered_user,2014-05-05,2016-10-10;revenue,0,5;"
val ListParser : List[String] = ListPar.split(";").map(_.trim).toList
val myInnerList : List[String] = ListParser(0).split(",").map(_.trim).toList
if (myInnerList(0) == "entered_user" || myInnerList(0) == "date" || myInnerList(0) == "dt_action"){
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))
}else{
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1).toInt, myInnerList(2).toInt))
}
well for all the fields except the one that contains date the functions works flawless but for fields that have date it throws an error
Note : I'm working with parquet files
here is the error
when i try to write it manually i get the same
here is how the query it sent to the sparkSQL
the first one where there is revenue it works but the second one doesn't work
and when i try to just filter with dates without the value of "vars" which contains other columns, it works
Well my issue is that i was mixing between sql and spark and when i tried to concatenate sql query which is my variable "vars" whith df.filter() and especially when i used between operator it was giving an output format unrocognised by sparksql which is
age IN ('0') AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
it might seems correct but after looking in sql documentation it was missing parenthesese(in vars) it needed to be
(age IN ('0')) AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
well the solution is i needed to concatenate those correctly so to do that i must to add " expr " to the variable vars which will result the desire syntaxe
responses.filter(expr(vars) && responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))

agg condition : keyword can't be an expression with Pyspark

I am using pyspark to reate a dataframe which calculates the sum of "montant" when the value of the column "isfraud" ==1 .
But I get this error :
File "", line 5
when(col("isFraud") =1, sum("montant"))
^ SyntaxError: keyword can't be an expression
Here the code :
CNP_df_fraude= (tx_wd_df
#.filter("isFraude =='1'").filter("POS_Card_Presence =='CardNotPresent'")
.groupBy("POS_Cardholder_Presence")
.agg(
when(col("isFraud") =1, sum("montant"))
)
)
Any idea please?
Thanks
Just put when() inside sum():
CNP_df_fraude= (tx_wd_df
.groupBy("POS_Cardholder_Presence")
.agg(
sum(when(col("isFraud")==1, col("montant")).otherwise(0))
)
)
You cannot use when() inside the .agg() function.
You could however try:
CNP_df_fraude= tx_wd_df.filter(F.col("isFraud") == 1)
.groupBy("POS_Cardholder_Presence")
.sum("montant")

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.