agg condition : keyword can't be an expression with Pyspark - pyspark

I am using pyspark to reate a dataframe which calculates the sum of "montant" when the value of the column "isfraud" ==1 .
But I get this error :
File "", line 5
when(col("isFraud") =1, sum("montant"))
^ SyntaxError: keyword can't be an expression
Here the code :
CNP_df_fraude= (tx_wd_df
#.filter("isFraude =='1'").filter("POS_Card_Presence =='CardNotPresent'")
.groupBy("POS_Cardholder_Presence")
.agg(
when(col("isFraud") =1, sum("montant"))
)
)
Any idea please?
Thanks

Just put when() inside sum():
CNP_df_fraude= (tx_wd_df
.groupBy("POS_Cardholder_Presence")
.agg(
sum(when(col("isFraud")==1, col("montant")).otherwise(0))
)
)

You cannot use when() inside the .agg() function.
You could however try:
CNP_df_fraude= tx_wd_df.filter(F.col("isFraud") == 1)
.groupBy("POS_Cardholder_Presence")
.sum("montant")

Related

can not join on multiple condition between two dataframes

I am trying to applicate jointure between two dataframes.
df_temp_5=df_temp_4 \
.join(df_position_g, cond, "left")
Where
cond1=df_position_g.position_pk==df_keys_position_g.position_pk
cond2=df_position_g.dt_deb_val==df_keys_position_g.max_dt
cond = [cond1 & cond2]
and
df_temp_4 =df_key_hub_instrument\
.join(df_lnk_position,["instrument_pk"], "outer") \
.join(df_key_hub_portefeuille,["portefeuille_pk"], "outer") \
.join(df_lnk_tiers_instrument,["instrument_pk"], "outer") \
.join(df_keys_position_hors_bilan,["position_pk"], "outer") \
.join(df_keys_portefeuille_sigma,["portefeuille_pk"], "outer") \
.join(df_keys_instrument_sigma,["instrument_pk"], "outer") \
.join(df_keys_cotation_sigma,["instrument_pk"], "outer")
Note that df_temp_4 is fine and there is no problem there .
But, I have issue when trying to join with df_temp_5
As an error :
Py4JJavaError: An error occurred while calling o466.join. :
org.apache.spark.sql.AnalysisException: Resolved attribute(s)
max_dt#238 missing from valeur_actuelle#88,montant_coupon_couru_acha
Any help, please ? thanks
You are trying to join df_temp_4 with df_position_g but the join cond mentions df_keys_position_g and df_position_g. This seems to be a mistake.
Also, cond1 and cond2 must have brackets around them because the operator precedence of & is higher than ==.
# This will work as expected
cond1=(df_position_g.position_pk==df_keys_position_g.position_pk)
cond2=(df_position_g.dt_deb_val==df_keys_position_g.max_dt)
cond = [cond1 & cond2]
Alternatively, avoid brackets if & is not mentioned explicitly.
# This will also work as expected
cond1=df_position_g.position_pk==df_keys_position_g.position_pk
cond2=df_position_g.dt_deb_val==df_keys_position_g.max_dt
cond = [cond1, cond2]
Finally, based on the error message shared it looks like there is more than one column called max_dt in df_position_g, hence it cannot identify the right column. It could also be due to this spark bug. So try renaming the columns used in the join condition just to be safe.
df_keys_position_g = df_keys_position_g.withColumnRenamed("position_pk", "keys_position_pk")\
.withColumnRenamed("max_dt", "keys_position_max_dt")
The condition should be
cond = cond1 & cond2
Instead of
cond = [cond1 & cond2]
This is because writing cond with brackets will treat it as a list.
looking at this, source code of DataFrame.join
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]

Spark / Scala / SparkSQL dataframes filter issue "data type mismatch"

My probleme is i have a code that gives filter column and values in a list as parameters
val vars = "age IN ('0')"
val ListPar = "entered_user,2014-05-05,2016-10-10;"
//val ListPar2 = "entered_user,2014-05-05,2016-10-10;revenue,0,5;"
val ListParser : List[String] = ListPar.split(";").map(_.trim).toList
val myInnerList : List[String] = ListParser(0).split(",").map(_.trim).toList
if (myInnerList(0) == "entered_user" || myInnerList(0) == "date" || myInnerList(0) == "dt_action"){
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))
}else{
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1).toInt, myInnerList(2).toInt))
}
well for all the fields except the one that contains date the functions works flawless but for fields that have date it throws an error
Note : I'm working with parquet files
here is the error
when i try to write it manually i get the same
here is how the query it sent to the sparkSQL
the first one where there is revenue it works but the second one doesn't work
and when i try to just filter with dates without the value of "vars" which contains other columns, it works
Well my issue is that i was mixing between sql and spark and when i tried to concatenate sql query which is my variable "vars" whith df.filter() and especially when i used between operator it was giving an output format unrocognised by sparksql which is
age IN ('0') AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
it might seems correct but after looking in sql documentation it was missing parenthesese(in vars) it needed to be
(age IN ('0')) AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
well the solution is i needed to concatenate those correctly so to do that i must to add " expr " to the variable vars which will result the desire syntaxe
responses.filter(expr(vars) && responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))

How to convert the Int column into a string in Pyspark?

Since I am a beginner of Pyspark can anyone help in doing conversion of an Integer Column into a String?
Here is my code in Aws Athena and I need to convert it into pyspark dataframe.
case when A.[HHs Reach] = 0 or A.[HHs Reach] is null then '0'
when A.[HHs Reach] = 1000000000 then '*'
else cast(A.[HHs Reach] as varchar) end as [HHs Reach]
assuming df is your dataframe, something like this :
from pyspark.sql import functions as F
df.withColumn(
"HHs Reach",
F.when(F.col("HHs Reach").isNull(), '0')
.when(F.col("HHs Reach") == 1000000000, '*')
.otherwise(F.col("HHs Reach").cast("string"))
)

replace column and get ltrim of the column value

I want to replace an column in an dataframe. need to get the scala
syntax code for this
Controlling_Area = CC2
Hierarchy_Name = CC2HIDNE
Need to write as : HIDENE
ie: remove the Controlling_Area present in Hierarchy_Name .
val dfPC = ReadLatest("/Full", "parquet")
.select(
LRTIM( REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"") ),
Col(ColumnN),
Col(ColumnO)
)
notebook:3: error: not found: value REPLACE
REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"")
^
Expecting to get the LTRIM and replace code in scala
You can use withColumnRenamed to achieve that:
import org.apache.spark.sql.functions
val dfPC = ReadLatest("/Full", "parquet")
.withColumnRenamed("Hierarchy_Name","Controlling_Area")
.withColumn("Controlling_Area",ltrim(col("Controlling_Area")))

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)