Pyspark SQL expression versus when() as a case statement

Pyspark SQL expression versus when() as a case statement - pyspark

I create a field called v1 in a previous query. Then I try to create a new derived field from it.
One method works, the other doesnt. I dont understand, I expected they are equivalent.
This works:
df = df.withColumn("outcome",expr("case when v1 = 0 then 1 when v1 > 0 then 2 else 0 end"))
This fails:
df = df.withColumn("outcome", F.when(F.col("v1") == 0, 1)
.F.when(F.col("v1") >0, 2)
.otherwise(0))
with error:
Py4JJavaError: An error occurred while calling o520.when.
: java.lang.IllegalArgumentException: when() can only be applied on a Column previously generated by when() function

You have called when from pyspark.sql.functions or F, You need to chain your when conditions (like F.when().when().when().otherwise()), you don't need to call it from F again.
Just change your code to :
df = df.withColumn("outcome", F.when(F.col("v1") == 0, 1)
.when(F.col("v1") >0, 2)
.otherwise(0))

Related

pyspark.sql.utils.ParseException error when filtering the df

I want to select all rows from a pyspark df except some rows where the array contains 1. It works with the code below in the notebook:
<pyspark df>.filter(~exists("<col name>", lambda x: x=="hello"))
But when I write it as this:
cond = '~exists("<col name>", lambda x: x=="hello")'
df = df.filter(con)
I got error as below:
pyspark.sql.utils.ParseException:
extraneous input 'x' expecting {')', ','}(line 1, pos 32)
I really can't spot any typo. Could someone give me a hint if I missed something?
Thanks, J

To pass in the conditions through variable, it needs to be written in the form of
expr str of spark sql. So it can be modified to:
cond = '!exists(col_name, x -> x == "hello")'

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
#udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())

You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as #Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)

Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)

Returning List From Customer Method Spark

I wanted to return List from custom function but im getting error def
myFunc (credit : Column) = {for ( i <- 0 to col("credit")) yield i}
Calling custom function
.withColumn("History" , explode (myFunc("credit")))
Error message "Expected column but found Seq"
I want to explode it to split into multiple rows.

DataFrame' object has no attribute 'add_suffix'

from pyspark.sql.functions import *
z= k.withColumn('date', when( k.date > 29, 1).otherwise(0)).collect()
i want to add suffix to the dataframe
z1 = k.add_suffix(19)
getting Error as
AttributeError: 'DataFrame' object has no attribute 'add_suffix'
Thanks

If you would like to add a suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_suffix(sdf, suffix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(c, suffix))
return sdf
You can amend sdf.columns as you see fit.

try with .withColumnRenamed function instead of add_suffix.
z1 = k.withColumnRenamed('date', 'date_19')
(or)
You can create a lambda function that can add suffix to all columnnames in data frame.
reference:-
How to add suffix and prefix to all columns in python/pyspark dataframe

Scala dataframe calling function inside withColumn?

Here is what I am trying to do, I have two tables that have exactly the same column names.
Table look somewhat like this:
-----------
A B C D
-----------
1 2 3 4
5 6 3 4
7 8 3 4
The logic of the problem I need to have is, compare A B C D columns in Table1 with Table2. If A,B match each other, return a new column with value 0, else return 0. If C from table A is 3, return 0, else return 1. Only one value should be returned for each row, priority: C>D>A=B.
I joined two tables(dataFrames), result in a combinedDf. This is how I join them: Table1.join(Table2,table1($"A")=table2($"A"))
So here is what I did:
def func(A:mutable.WrappedArray[String],B:mutable.WrappedArray[String],C:String,D:String) =
{if(C=="3") "0";
else if(D=="4")"1";
else if ((0 to A.length-1).exists(i => A(i) == B(i)))"0" else "1"}
For this function I want to put A,B columns from Table1 in to one array and A,B column from Table2 into another array and running a for loop to check the equality. (I need the array because for real case, I have a random number of columns I need to compare).
And here is how I call the function.
combinedDf.withColumn("returnVal",func(array(col("table1.A"),col("table1.B")),
array(col("table2.A"),col("table2.B")),col("table1.C"),col("table1.D")))
But it's just doesn't work, even though I put the columns inside array using the array function its' still telling me type mismatch.
Error Message: <console:67>: error: type mismatch; found:org.apache.spark.Column required: String
Thanks in advance!

You can try this, however help me understand one thing why do you need to combine the dataframes, and what you mean by if A and B matches(my assumption is per row, am i right?), If your A,B,C,D columns are string then change integer to String.
def func(A:Integer,B:Integer,C:Integer,D:Integer) =
{
if(C == 3) "0"
else if(D == 4) "1"
else if (A == B) "0"
else "1"
}
val udfFunc = udf(func _)
combinedDf.withColumn("returnVal",
udfFunc(col("table1.A"), col("table1.B"),
col("table1.C"),col("table1.D")
)
)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark SQL expression versus when() as a case statement - pyspark

You have called when from pyspark.sql.functions or F, You need to chain your when conditions (like F.when().when().when().otherwise()), you don't need to call it from F again. Just change your code to : df = df.withColumn("outcome", F.when(F.col("v1") == 0, 1) .when(F.col("v1") >0, 2) .otherwise(0))

Related

pyspark.sql.utils.ParseException error when filtering the df

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

Returning List From Customer Method Spark

DataFrame' object has no attribute 'add_suffix'

Scala dataframe calling function inside withColumn?

Categories

Resources