If-If statement Scala Spark - scala

I have a dataframe for which I have to create a new column based on values in the already existing columns. The catch is, I can't write CASE statements, because here it checks for first WHEN condition if it is not satisfied then it will go to next WHEN. E.g. consider this dataframe:
+-+-----+-+
|A|B |C|
+-+-----+-+
|1|true |1|-----> Condition 1 and 2 is satisfied Here
|1|true |0|-----> Condition 1 is satisfied here
|1|false|1|
|2|true |1|
|2|true |0|
+-+-----+-+
Consider this CASE statement:
CASE WHEN A = 1 and B = 'true' then 'A'
WHEN A = 1 and B = 'true' and C=1 then 'B'
END
It gives me no row for value B.
Expected output:
+-+-----+-+----+
|A|B |C|D |
+-+-----+-+----+
|1|true |1|A |
|1|true |1|B |
|1|true |0|A |
|1|false|1|null|
|2|true |1|null|
|2|true |0|null|
+-+-----+-+----+
I know I can derive this in 2 separate dataframes and then union them. But I am looking for more efficient solution.

Creating the dataframe:
val df1 = Seq((1, true, 1), (1, true, 0), (1, false, 1), (2, true, 1), (2, true, 0)).toDF("A", "B", "C")
df1.show()
// +---+-----+---+
// | A| B| C|
// +---+-----+---+
// | 1| true| 1|
// | 1| true| 0|
// | 1|false| 1|
// | 2| true| 1|
// | 2| true| 0|
// +---+-----+---+
The code:
val condition1 = ($"A" === 1) && ($"B" === true)
val condition2 = condition1 && ($"C" === 1)
val arr1 = array(when(condition1, "A"), when(condition2, "B"))
val arr2 = when(element_at(arr1, 2).isNull, slice(arr1, 1, 1)).otherwise(arr1)
val df2 = df.withColumn("D", explode(arr2))
df2.show()
// +---+-----+---+----+
// | A| B| C| D|
// +---+-----+---+----+
// | 1| true| 1| A|
// | 1| true| 1| B|
// | 1| true| 0| A|
// | 1|false| 1|null|
// | 2| true| 1|null|
// | 2| true| 0|null|
// +---+-----+---+----+

Related

how to create a dataframe based on the first appearing date and based on additional columns each id column

i try to create a dataframe with following condition:
I have multiple IDs, multiple columns with defaults (0 or 1) and a startdate column. I would like to get a dataframe with the appearing defaults based on the first startdate (default_date) and each id.
the orginal df looks like this:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2019-01-31|
| 02| 1| 1| 0| 2018-12-31|
| 03| 1| 1| 1| 2018-10-31|
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 0| 2018-07-31|
| 03| 1| 1| 1| 2019-05-31|
this is how i would like to have it:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 1| 2018-07-31|
i tried following code:
val w = Window.partitionBy($"id").orderBy($"date".asc)
val reult = join3.withColumn("rn", row_number.over(w)).where($"def_a" === 1 || $"def_b" === 1 ||$"def_c" === 1).filter($"rn" >= 1).drop("rn")
result.show
I would be grateful for any help
This should work for you. First assign the min date to the original df then join the new df2 with df.
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1,1,0,1,"2019-01-31"),
(2,1,1,0,"2018-12-31"),
(3,1,1,1,"2018-10-31"),
(1,1,0,1,"2018-09-30"),
(2,1,1,0,"2018-08-31"),
(3,1,1,0,"2018-07-31"),
(3,1,1,1,"2019-05-31"))
.toDF("id" ,"def_a" , "def_b", "deb_c", "date")
val w = Window.partitionBy($"id").orderBy($"date".asc)
val df2 = df.withColumn("date", $"date".cast("date"))
.withColumn("min_date", min($"date").over(w))
.select("id", "min_date")
.distinct()
df.join(df2, df("id") === df2("id") && df("date") === df2("min_date"))
.select(df("*"))
.show
And the output should be:
+---+-----+-----+-----+----------+
| id|def_a|def_b|deb_c| date|
+---+-----+-----+-----+----------+
| 1| 1| 0| 1|2018-09-30|
| 2| 1| 1| 0|2018-08-31|
| 3| 1| 1| 0|2018-07-31|
+---+-----+-----+-----+----------+
By the way I believe you had a little mistake on your expected results. It is (3, 1, 1, 0, 2018-07-31) not (3, 1, 1, 1, 2018-07-31)

Scala/Spark: How to select columns to read ONLY when list of columns > 0

I'm passing in a parameter fieldsToLoad: List[String] and I want to load ALL columns if this list is empty and load only the columns specified in the list if the list has more one or more columns. I have this now which reads the columns passed in the list:
val parquetDf = sparkSession.read.parquet(inputPath:_*).select(fieldsToLoad.head, fieldsToLoadList.tail:_*)
But how do I add a condition to load * (all columns) when the list is empty?
#Andy Hayden answer is correct but I want to introduce how to use selectExpr function to simplify the selection
scala> val df = Range(1, 4).toList.map(x => (x, x + 1, x + 2)).toDF("c1", "c2", "c3")
df: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
scala> df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
scala> val fieldsToLoad = List("c2", "c3")
fieldsToLoad: List[String] = List(c2, c3) ^
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+
| c2| c3|
+---+---+
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+
scala> val fieldsToLoad = List()
fieldsToLoad: List[Nothing] = List()
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
You could use an if statement first to replace the empty with just *:
val cols = if (fieldsToLoadList.nonEmpty) fieldsToLoadList else Array("*")
sparkSession.read.parquet(inputPath:_*).select(cols.head, cols.tail:_*).

Joining data in spark data frames using Scala

I have a Spark dataframe in Scala as below -
val df = Seq(
(0,0,0,0.0,0),
(1,0,0,0.1,1),
(0,1,0,0.11,1),
(0,0,1,0.12,1),
(1,1,0,0.24,2),
(1,0,1,0.27,2),
(0,1,1,0.3,2),
(1,1,1,0.4,3)
).toDF("A","B","C","rate","total")
Here is how it looks like
scala> df.show
+---+---+---+----+-----+
| A| B| C|rate|total|
+---+---+---+----+-----+
| 0| 0| 0| 0.0| 0|
| 1| 0| 0| 0.1| 1|
| 0| 1| 0|0.11| 1|
| 0| 0| 1|0.12| 1|
| 1| 1| 0|0.24| 2|
| 1| 0| 1|0.27| 2|
| 0| 1| 1| 0.3| 2|
| 1| 1| 1| 0.4| 3|
+---+---+---+----+-----+
A,B and C are channels in this case. 0 and 1 represent absence and presence of channels respectively. 2^3 shows 8 combinations in the data-frame with a column 'total' giving row-wise sum of these 3 channels.
The individual probabilities of these channel occurrence can be given by -
scala> val oneChannelCase = df.filter($"total" === 1).toDF()
scala> oneChannelCase.show()
+---+---+---+----+-----+
| A| B| C|rate|total|
+---+---+---+----+-----+
| 1| 0| 0| 0.1| 1|
| 0| 1| 0|0.11| 1|
| 0| 0| 1|0.12| 1|
+---+---+---+----+-----+
However, I am interested in only pair-wise probabilities of these channels which is given by -
scala> val probs = df.filter($"total" === 2).toDF()
scala> probs.show()
+---+---+---+----+-----+
| A| B| C|rate|total|
+---+---+---+----+-----+
| 1| 1| 0|0.24| 2|
| 1| 0| 1|0.27| 2|
| 0| 1| 1| 0.3| 2|
+---+---+---+----+-----+
What I would like to do is - append 3 new columns to these "probs" dataframe that shows individual probabilities. Below is the output that I am looking for -
A B C rate prob_A prob_B prob_C
1 1 0 0.24 0.1 0.11 0
1 0 1 0.27 0.1 0 0.12
0 1 1 0.3 0 0.11 0.12
To make thing clearer, the first row of output result shows A=1, B=1, C=0. Hence the individual probabilities for A=0.1, B=0.11 and C=0 is appended to the probs dataframe respectively. Similarly, for second row, A=1, B=0, C=1 shows individual probabilities for A=0.1, B=0 and C=0.12 is appended to the probs dataframe respectively.
Here is what I have tried -
scala> val channels = df.columns.filter(v => !(v.contains("rate") | v.contains("total")))
#channels: Array[String] = Array(A, B, C)
scala> val pivotedProb = channels.map(v => f"case when $v = 1 then rate else 0 end as prob_${v}")
scala> val param = pivotedProb.mkString(",")
scala> val probs = spark.sql(f"select *, $param from df")
scala> probs.show()
+---+---+---+----+-----+------+------+------+
| A| B| C|rate|total|prob_A|prob_B|prob_C|
+---+---+---+----+-----+------+------+------+
| 0| 0| 0| 0.0| 0| 0.0| 0.0| 0.0|
| 1| 0| 0| 0.1| 1| 0.1| 0.0| 0.0|
| 0| 1| 0|0.11| 1| 0.0| 0.11| 0.0|
| 0| 0| 1|0.12| 1| 0.0| 0.0| 0.12|
| 1| 1| 0|0.24| 2| 0.24| 0.24| 0.0|
| 1| 0| 1|0.27| 2| 0.27| 0.0| 0.27|
| 0| 1| 1| 0.3| 2| 0.0| 0.3| 0.3|
| 1| 1| 1| 0.4| 3| 0.4| 0.4| 0.4|
+---+---+---+----+-----+------+------+------+
which gives me the wrong output.
Kindly help.
If I understand your requirement correctly, using foldLeft to traverse the channel columns, you can 1) generate a ratesMap from the one-channel dataframe, and, 2) add columns to the two-channel dataframe with column values equal to product of channel and corresponding ratesMap value:
val df = Seq(
(0, 0, 0, 0.0, 0),
(1, 0, 0, 0.1, 1),
(0, 1, 0, 0.11, 1),
(0, 0, 1, 0.12, 1),
(1, 1, 0, 0.24, 2),
(1, 0, 1, 0.27, 2),
(0, 1, 1, 0.3, 2),
(1, 1, 1, 0.4, 3)
).toDF("A", "B", "C", "rate", "total")
val oneChannelDF = df.filter($"total" === 1)
val twoChannelDF = df.filter($"total" === 2)
val channels = df.columns.filter(v => !(v.contains("rate") || v.contains("total")))
// channels: Array[String] = Array(A, B, C)
val ratesMap = channels.foldLeft( Map[String, Double]() ){ (acc, c) =>
acc + (c -> oneChannelDF.select("rate").where(col(c) === 1).head.getDouble(0))
}
// ratesMap: scala.collection.immutable.Map[String,Double] = Map(A -> 0.1, B -> 0.11, C -> 0.12)
val probsDF = channels.foldLeft( twoChannelDF ){ (acc, c) =>
acc.withColumn( "prob_" + c, col(c) * ratesMap.getOrElse(c, 0.0) )
}
probsDF.show
// +---+---+---+----+-----+------+------+------+
// | A| B| C|rate|total|prob_A|prob_B|prob_C|
// +---+---+---+----+-----+------+------+------+
// | 1| 1| 0|0.24| 2| 0.1| 0.11| 0.0|
// | 1| 0| 1|0.27| 2| 0.1| 0.0| 0.12|
// | 0| 1| 1| 0.3| 2| 0.0| 0.11| 0.12|
// +---+---+---+----+-----+------+------+------+

How to use Sum on groupBy result in Spark DatFrames?

Based on the following dataframe:
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 1| A| 5|
| 2| A| 56|
| 2| B| 13|
+---+-----+----+
I would like to obtain the sum of the column Amnt groupby ID and Categ.
+---+-----+-----+
| ID|Categ|Count|
+---+-----+-----+
| 1| A| 15 |
| 2| A| 56 |
| 2| B| 13 |
+---+-----+-----+
In SQL I would be doing something like
SELECT ID,
Categ,
SUM (Count)
FROM Table
GROUP BY ID,
Categ;
But how to do this in Scala?
I tried
DF.groupBy($"ID", $"Categ").sum("Count")
But this just changed the Count column name into sum(count) instead of actually giving me the sum of the counts.
Maybe you were summing the wrong column, but your grougBy/sum statement looks syntactically correct to me:
val df = Seq(
(1, "A", 10),
(1, "A", 5),
(2, "A", 56),
(2, "B", 13)
).toDF("ID", "Categ", "Amnt")
df.groupBy("ID", "Categ").sum("Amnt").show
// +---+-----+---------+
// | ID|Categ|sum(Amnt)|
// +---+-----+---------+
// | 1| A| 15|
// | 2| A| 56|
// | 2| B| 13|
// +---+-----+---------+
EDIT:
To alias the sum(Amnt) column (or, for multiple aggregations), wrap the aggregation expression(s) with agg. For example:
// Rename `sum(Amnt)` as `Sum`
df.groupBy("ID", "Categ").agg(sum("Amnt").as("Sum"))
// Aggregate `sum(Amnt)` and `count(Categ)`
df.groupBy("ID", "Categ").agg(sum("Amnt"), count("Categ"))

Aggregation the derived column spark

DF.groupBy("id")
.agg(
sum((when(upper($"col_name") === "text", 1)
.otherwise(0)))
.alias("df_count")
.when($"df_count"> 1, 1)
.otherwise(0)
)
Can I do aggregation on the column which was named as alias? ,i.e if the sum is greater than one then return 1 else 0
Thanks in advance.
I think you could wrap another when.otherwise around the sum result:
val df = Seq((1, "a"), (1, "a"), (2, "b"), (3, "a")).toDF("id", "col_name")
df.show
+---+--------+
| id|col_name|
+---+--------+
| 1| a|
| 1| a|
| 2| b|
| 3| a|
+---+--------+
df.groupBy("id").agg(
sum(when(upper($"col_name") === "A", 1).otherwise(0)).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 2|
| 3| 1|
| 2| 0|
+---+--------+
df.groupBy("id").agg(
when(sum(when(upper($"col_name")==="A", 1).otherwise(0)) > 1, 1).otherwise(0).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 1|
| 3| 0|
| 2| 0|
+---+--------+