Create new column based on equality between existing columns - scala

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?

You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

Spark-Scala: Get Dataframe Variable by concatenating two String Variables

I have a scenario where I need to form a dataframe name from two string variable. Which is pretty easy and can be done by concatenating.
Example: "df_" + "part1324"
The above code will return a String variable. I want this to be a Dataframe variable through which I can perform further operation on the data frame.
Map can be used for assign names to DataFrames:
val df = List(("df_value")).toDF()
val stringVariable = "part1324"
// assign name to dataframe
val namedDataFrames = Map("df_" + stringVariable -> df)
// get dataframe by name
namedDataFrames("df_part1324").show(false)
Your question is confusing. What do you mean by dataframe variable? Concatenating two strings will always return String. In order to create a dataframe, you need to apply the different methods available to create a dataframe.
val df:Dataframe cannot be equal to df_part1234 (String)as per your example but to use it as dataframe, you need to do something like below
val df_part1234 = sc.range(1000).toDF("number") where sc is your Sparksession variable.
In case you need to generate this variable dynamically, place it under the logic of variable generation like Loop and add the statement to create the dataframe.
Please rewrite your question if you are trying to achieve something else (along with code snippet to reproduce the issue) or accept the answer if you are clear on the issue

change a dataframe row value with dynamic number of columns spark scala

I have a dataframe (contains 10 columns) for which I want to change the value of a row (for the last column only). I have written following code for this:
val newDF = spark.sqlContext.createDataFrame(WRADF.rdd.map(r=> {
Row(r.get(0), r.get(1),
r.get(2), r.get(3),
r.get(4), r.get(5),
r.get(6), r.get(7),
r.get(8), decrementCounter(r))
}), WRADF.schema)
I want to change the value of a row for 10th column only (for which I wrote decrementCounter() function). But the above code only runs for dataframes with 10 columns. I don't know how to convert this code so that it can run for different dataframe (with different number of columns). Any help will be appreciated.
Don't do something like this. Define udf
import org.apache.spark.sql.functions.udf._
val decrementCounter = udf((x: T) => ...) // adjust types and content to your requirements
df.withColumn("someName", decrementCounter($"someColumn"))
I think UDF will be a better choice because it can be applied using the Column name itself.
For more on udf you can take a look here : https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
For your code just use this :
import org.apache.spark.sql.functions.udf._
val decrementCounterUDF = udf(decrementCounter _)
df.withColumn("columnName", decrementCounterUDF($"columnName"))
What it will does is apply this decrementCounter function on each and every value of column columnName.
I hope this helps, cheers !

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Apache Spark Dataframe - Issue with setting up a not-equal join

I have 2 dataframes that I'm doing a multi-column join. The first pair of columns is doing an equal comparison and the second pair is a not-equals comparison. The code looks like this:
val arule_1w = itemLHS
.join(itemRHS, itemLHS("CUST_ID") === itemRHS("CUST_ID") && itemLHS("LHS") != itemRHS("RHS")
The resulting data still has rows that contains itemLHS("LHS") = itemRHS("RHS"), which it shouldn't, with the not-equal join. It maybe user error as well but all my research tells me that format is correct. All datatypes are string values.
Thanks for your help!
Correct method is =!= not !=.
Used below syntaxt
itemLHS("LHS") !== itemRHS("RHS")