I create a field called v1 in a previous query. Then I try to create a new derived field from it.
One method works, the other doesnt. I dont understand, I expected they are equivalent.
This works:
df = df.withColumn("outcome",expr("case when v1 = 0 then 1 when v1 > 0 then 2 else 0 end"))
This fails:
df = df.withColumn("outcome", F.when(F.col("v1") == 0, 1)
.F.when(F.col("v1") >0, 2)
.otherwise(0))
with error:
Py4JJavaError: An error occurred while calling o520.when.
: java.lang.IllegalArgumentException: when() can only be applied on a Column previously generated by when() function
You have called when from pyspark.sql.functions or F, You need to chain your when conditions (like F.when().when().when().otherwise()), you don't need to call it from F again.
Just change your code to :
df = df.withColumn("outcome", F.when(F.col("v1") == 0, 1)
.when(F.col("v1") >0, 2)
.otherwise(0))
Related
I want to select all rows from a pyspark df except some rows where the array contains 1. It works with the code below in the notebook:
<pyspark df>.filter(~exists("<col name>", lambda x: x=="hello"))
But when I write it as this:
cond = '~exists("<col name>", lambda x: x=="hello")'
df = df.filter(con)
I got error as below:
pyspark.sql.utils.ParseException:
extraneous input 'x' expecting {')', ','}(line 1, pos 32)
I really can't spot any typo. Could someone give me a hint if I missed something?
Thanks, J
To pass in the conditions through variable, it needs to be written in the form of
expr str of spark sql. So it can be modified to:
cond = '!exists(col_name, x -> x == "hello")'
I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
#udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as #Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
I wanted to return List from custom function but im getting error def
myFunc (credit : Column) = {for ( i <- 0 to col("credit")) yield i}
Calling custom function
.withColumn("History" , explode (myFunc("credit")))
Error message "Expected column but found Seq"
I want to explode it to split into multiple rows.
from pyspark.sql.functions import *
z= k.withColumn('date', when( k.date > 29, 1).otherwise(0)).collect()
i want to add suffix to the dataframe
z1 = k.add_suffix(19)
getting Error as
AttributeError: 'DataFrame' object has no attribute 'add_suffix'
Thanks
If you would like to add a suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_suffix(sdf, suffix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(c, suffix))
return sdf
You can amend sdf.columns as you see fit.
try with .withColumnRenamed function instead of add_suffix.
z1 = k.withColumnRenamed('date', 'date_19')
(or)
You can create a lambda function that can add suffix to all columnnames in data frame.
reference:-
How to add suffix and prefix to all columns in python/pyspark dataframe
Here is what I am trying to do, I have two tables that have exactly the same column names.
Table look somewhat like this:
-----------
A B C D
-----------
1 2 3 4
5 6 3 4
7 8 3 4
The logic of the problem I need to have is, compare A B C D columns in Table1 with Table2. If A,B match each other, return a new column with value 0, else return 0. If C from table A is 3, return 0, else return 1. Only one value should be returned for each row, priority: C>D>A=B.
I joined two tables(dataFrames), result in a combinedDf. This is how I join them: Table1.join(Table2,table1($"A")=table2($"A"))
So here is what I did:
def func(A:mutable.WrappedArray[String],B:mutable.WrappedArray[String],C:String,D:String) =
{if(C=="3") "0";
else if(D=="4")"1";
else if ((0 to A.length-1).exists(i => A(i) == B(i)))"0" else "1"}
For this function I want to put A,B columns from Table1 in to one array and A,B column from Table2 into another array and running a for loop to check the equality. (I need the array because for real case, I have a random number of columns I need to compare).
And here is how I call the function.
combinedDf.withColumn("returnVal",func(array(col("table1.A"),col("table1.B")),
array(col("table2.A"),col("table2.B")),col("table1.C"),col("table1.D")))
But it's just doesn't work, even though I put the columns inside array using the array function its' still telling me type mismatch.
Error Message: <console:67>: error: type mismatch; found:org.apache.spark.Column required: String
Thanks in advance!
You can try this, however help me understand one thing why do you need to combine the dataframes, and what you mean by if A and B matches(my assumption is per row, am i right?), If your A,B,C,D columns are string then change integer to String.
def func(A:Integer,B:Integer,C:Integer,D:Integer) =
{
if(C == 3) "0"
else if(D == 4) "1"
else if (A == B) "0"
else "1"
}
val udfFunc = udf(func _)
combinedDf.withColumn("returnVal",
udfFunc(col("table1.A"), col("table1.B"),
col("table1.C"),col("table1.D")
)
)