Perform Arithmetic Operations on multiple columns in Spark dataframe - scala

I have an input spark-dataframe named df as
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
I have done this using Scala as,
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
But I don't prefer this for loop because my df is of larger size. What is the optimized solution?

If you have dataframe as
val df = Seq(
("725153", 1, 0, 2, 3),
("873008", 0, 0, 3, 3),
("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")
+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153 |1 |0 |2 |3 |
|873008 |0 |0 |3 |3 |
|625109 |1 |1 |0 |2 |
+---------------+---+---+---+-----------+
You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3
val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList
df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)
which should give you
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153 |0.3333333333333333|0.0|0.6666666666666666|3 |
|873008 |0.0 |0.0|1.0 |3 |
|625109 |0.5 |0.5|0.0 |2 |
+---------------+------------------+---+------------------+-----------+
I hope the answer is helpful

Related

groupBy and get count of records for multiple columns in scala

As a part of big task I am facing some issues when I reach to find the count of records in each column grouping by another column. I am not much experienced in playing around with dataframe columns.
I am having a spark dataframe as below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |0 |
|050|2021-01-15 |1 |3 |0 |
|050|2021-02-02 |1 |3 |0 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |1 |3 |0 |
|051|2021-02-02 |1 |3 |0 |
|051|2021-02-03 |1 |3 |0 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |1 |3 |0 |
|052|2021-03-06 |1 |3 |0 |
|052|2021-03-16 |1 |3 |0 |
I am working in scala language to make use of this data frame and trying to get result as shown below.
+---+--------+--------+--------+
|id |signal01|signal02|signal03|
+---+--------+--------+--------+
|050|3 |3 |3 |
|051|4 |4 |4 |
|052|4 |4 |4 |
for each Id, the count for each signal should be the output.
And also is there any way we could pass condition to get the count, such as count of signals with value > 0?
I have tried something below, getting total count ,but not grouped with Id which was not expected.
val signalColumns = ((Temp01DF.columns.toBuffer) -= ("id","date"))
val Temp02DF = Temp01DF.select(signalColumns.map(c => count(col(c)).alias(c)): _*).show()
+--------+--------+--------+
|signal01|signal02|signal03|
+--------+--------+--------+
|51 |51 |51 |
Is there any ways to achieve this in scala lang?
You are probably looking for groupBy, agg and count.
You can do something like this:
// define some data
val df = Seq(
("050", 1, 3, 0),
("050", 1, 3, 0),
("050", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0)
).toDF("id", "signal01", "signal02", "signal03")
val countColumns = Seq("signal01", "signal02", "signal03").map(c => count("*").as(c))
df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052| 4| 4| 4|
|051| 4| 4| 4|
|050| 3| 3| 3|
+---+--------+--------+--------+
*/
Instead of counting "*", you can have a predicate:
val countColumns = Seq("signal01", "signal02", "signal03").map(c => count(when(col(c) > 0, 1)).as(c))
df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052| 4| 4| 0|
|051| 4| 4| 0|
|050| 3| 3| 0|
+---+--------+--------+--------+
*/
A PySpark Solution
df = spark.createDataFrame([(50, 1, 3, 0),(50, 1, 3, 0), (50, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (52, 1, 3, 0),(52, 1, 3, 0), (52, 1, 3, 0), (52, 1, 3, 0)],[ "col1","col2", "col3", "col4"])
df.show()
df_grp = df.groupBy("col1").agg(F.count("col2").alias("col2"), F.count("col3").alias("col3"), F.count("col4").alias("col4"))
df_grp.show()
Output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 50| 3| 3| 3|
| 51| 4| 4| 4|
| 52| 4| 4| 4|
+----+----+----+----+
For the first part, I found that the required result can be achieved this way:
val signalCount = df.groupBy("id")
.agg(count("signal01"), count("signal02"), count("signal03"))
Make sure you have the spark functions imported:
import org.apache.spark.sql.functions._

Find min value for every 5 hour interval

My df
val df = Seq(
("1", 1),
("1", 1),
("1", 2),
("1", 4),
("1", 5),
("1", 6),
("1", 8),
("1", 12),
("1", 12),
("1", 13),
("1", 14),
("1", 15),
("1", 16)
).toDF("id", "time")
For this case the first interval starts from 1 hour. So every row up to 6 (1 + 5) is part of this interval.
But 8 - 1 > 5, so the second interval starts from 8 and goes up to 13.
Then I see that 14 - 8 > 5, so the third one starts and so on.
The desired result
+---+----+--------+
|id |time|min_time|
+---+----+--------+
|1 |1 |1 |
|1 |1 |1 |
|1 |2 |1 |
|1 |4 |1 |
|1 |5 |1 |
|1 |6 |1 |
|1 |8 |8 |
|1 |12 |8 |
|1 |12 |8 |
|1 |13 |8 |
|1 |14 |14 |
|1 |15 |14 |
|1 |16 |14 |
+---+----+--------+
I'm trying to do it using min function, but don't know how to account for this condition.
val window = Window.partitionBy($"id").orderBy($"time")
df
.select($"id", $"time")
.withColumn("min_time", when(($"time" - min($"time").over(window)) <= 5, min($"time").over(window)).otherwise($"time"))
.show(false)
what I get
+---+----+--------+
|id |time|min_time|
+---+----+--------+
|1 |1 |1 |
|1 |1 |1 |
|1 |2 |1 |
|1 |4 |1 |
|1 |5 |1 |
|1 |6 |1 |
|1 |8 |8 |
|1 |12 |12 |
|1 |12 |12 |
|1 |13 |13 |
|1 |14 |14 |
|1 |15 |15 |
|1 |16 |16 |
+---+----+--------+
You can go with your first idea of using aggregation function on a window. But instead of using some combination of Spark's already defined functions, you can define your own Spark's user-defined aggregate function (UDAF).
Analysis
As you correctly supposed, we should use a kind of min function on a window. On the rows of this window, we want to implement the following rule:
Given rows sorted by time, if the difference between the min_time of the previous row and the time of the current row is greater than 5, then the current row's min_time should be current row's time, else the current row's min_time should be previous row's min_time.
However, with the aggregate functions provided by Spark, we can't access to the previous row's min_time. It exists a lag function, but with this function we can only access to the already present values of previous rows. As the previous row's min_time is not already present, we can't access it.
Thus we have to define our own aggregate function
Solution
Defining a tailor-made aggregate function
To define our aggregate function, we need to create a class that extends the Aggregator abstract class. Below is the complete implementation:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}
object MinByInterval extends Aggregator[Integer, Integer, Integer] {
def zero: Integer = null
def reduce(buffer: Integer, time: Integer): Integer = {
if (buffer == null || time - buffer > 5) time else buffer
}
def merge(b1: Integer, b2: Integer): Integer = {
throw new NotImplementedError("should not use as general aggregation")
}
def finish(reduction: Integer): Integer = reduction
def bufferEncoder: Encoder[Integer] = Encoders.INT
def outputEncoder: Encoder[Integer] = Encoders.INT
}
We use Integer for input, buffer and output types. We chose Integer as it is a nullable Int. We could have used Option[Int], however the documentation of Spark advises to not recreate objects in aggregators methods for performance issues, what would happens if we use complex types like Option.
We implement the rule defined in Analysis section in reduce method:
def reduce(buffer: Integer, time: Integer): Integer = {
if (buffer == null || time - buffer > 5) time else buffer
}
Here time is the value in the column time of the current row, and buffer the value previously computed, so corresponding to the column min_time of the previous row. As in our window we sort the rows by time, time is always greater than buffer. The null buffer case only happens when treating first row.
The method merge is not used when using aggregate function over a window, so we don't implement it.
finish method is identity method as we don't need to perform final calculation on our aggregated value and output and buffer encoders are Encoders.INT
Calling user defined aggregate function
Now we can call our user defined aggregate function with the following code:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val minTime = udaf(MinByInterval)
val window = Window.partitionBy("id").orderBy("time")
df.withColumn("min_time", minTime(col("time")).over(window))
Run
Given the input dataframe in the question, we get:
+---+----+--------+
|id |time|min_time|
+---+----+--------+
|1 |1 |1 |
|1 |1 |1 |
|1 |2 |1 |
|1 |4 |1 |
|1 |5 |1 |
|1 |6 |1 |
|1 |8 |8 |
|1 |12 |8 |
|1 |12 |8 |
|1 |13 |8 |
|1 |14 |14 |
|1 |15 |14 |
|1 |16 |14 |
+---+----+--------+
Input data
val df = Seq(
("1", 1),
("1", 1),
("1", 2),
("1", 4),
("1", 5),
("1", 6),
("1", 8),
("1", 12),
("1", 12),
("1", 13),
("1", 14),
("1", 15),
("1", 16),
("2", 4),
("2", 8),
("2", 10),
("2", 11),
("2", 11),
("2", 12),
("2", 13),
("2", 20)
).toDF("id", "time")
The data must be sorted, otherwise the result will be incorect.
val window = Window.partitionBy($"id").orderBy($"time")
df
.withColumn("min", row_number().over(window))
.as[Row]
.map(_.getMin)
.show(40)
After, I create a case class. var min is used to hold the minimum value and is only updated when the conditions are met.
case class Row(id:String, time:Int, min:Int){
def getMin: Row = {
if(time - Row.min > 5 || Row.min == -99 || min == 1){
Row.min = time
}
Row(id, time, Row.min)
}
}
object Row{
var min: Int = -99
}
Result
+---+----+---+
| id|time|min|
+---+----+---+
| 1| 1| 1|
| 1| 1| 1|
| 1| 2| 1|
| 1| 4| 1|
| 1| 5| 1|
| 1| 6| 1|
| 1| 8| 8|
| 1| 12| 8|
| 1| 12| 8|
| 1| 13| 8|
| 1| 14| 14|
| 1| 15| 14|
| 1| 16| 14|
| 2| 4| 4|
| 2| 8| 4|
| 2| 10| 10|
| 2| 11| 10|
| 2| 11| 10|
| 2| 12| 10|
| 2| 13| 10|
| 2| 20| 20|
+---+----+---+

Spark: explode multiple columns into one

Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+

transform a feature of a spark groupedBy DataFrame

I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+
You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+

How to get a Tuple for the grouped by result on a Spark Dataframe?

I am trying to group entities based on the id, running the following code I have this dataframe:
val pet_type_count = pet_list.groupBy("id","pets_type").count()
pet_type_count.sort("id").limit(20).show
+----------+---------------------+-----+
| id| pets_type|count|
+----------+---------------------+-----+
| 0| 0| 2|
| 1| 0| 3|
| 1| 3| 3|
| 10| 0| 4|
| 10| 1| 1|
| 13| 0| 3|
| 16| 1| 3|
| 17| 1| 1|
| 18| 1| 2|
| 18| 0| 1|
| 19| 1| 7|
+----------+---------------------+-----+
I want to group the results of the group by on id to now return a list of tuples per id so I can apply the following udf per id:
val agg_udf = udf { (v1: List[Tuple2[String, String]]) =>
var feature_vector = Array.fill(5)(0)
for (row <- v1) {
val index = (5 - row._1.toInt)
vector(index) = row._2.toInt
}
vector
}
val pet_vector_included = pet_type_count.groupBy("id").agg(agg_udf(col("pets_type_count")).alias("pet_count_vector"))
For which I need to get the following:
+----------+---------------------+-----+
| id| pets_type_count|
+----------+---------------------+-----+
| 0| (0,2)|
| 1| (0,3)|
| | (3,3)|
| 10| (0,4)|
| | (1,1)|
| 13| (0,3)|
| 16| (1,3)|
| 17| (1,1)|
| 18| (1,2)|
| | (0,1)|
| 19| (1,7)|
+----------+---------------------+-----+
I am unable to figure out the how to get tuples after the groupby on id. Any help would be appreciated!
You can simply use struct inbuilt function to make pets_type and count columns as one column and use collect_list inbuilt function to collect the newly formed column when grouped by id. And you can orderBy just to order the dataframe by id column.
import org.apache.spark.sql.functions._
val pet_type_count = df.withColumn("struct", struct("pets_type", "count"))
.groupBy("id").agg(collect_list(col("struct")).as("pets_type_count"))
.orderBy("id")
this should give you your desired result as
+---+---------------+
|id |pets_type_count|
+---+---------------+
|0 |[[0,2]] |
|1 |[[0,3], [3,3]] |
|10 |[[0,4], [1,1]] |
|13 |[[0,3]] |
|16 |[[1,3]] |
|17 |[[1,1]] |
|18 |[[1,2], [0,1]] |
|19 |[[1,7]] |
+---+---------------+
So you can apply the udf function that you have defined (which needs some modifications too) as below
val agg_udf = udf { (v1: Seq[Row]) =>
var feature_vector = Array.fill(5)(0)
for (row <- v1) {
val index = (4 - row.getAs[Int](0))
feature_vector(index) = row.getAs[Int](1)
}
feature_vector
}
val pet_vector_included = pet_type_count.withColumn("pet_count_vector", agg_udf(col("pets_type_count")))
pet_vector_included.show(false)
which should give you
+---+---------------+----------------+
|id |pets_type_count|pet_count_vector|
+---+---------------+----------------+
|0 |[[0,2]] |[0, 0, 0, 0, 2] |
|1 |[[0,3], [3,3]] |[0, 3, 0, 0, 3] |
|10 |[[0,4], [1,1]] |[0, 0, 0, 1, 4] |
|13 |[[0,3]] |[0, 0, 0, 0, 3] |
|16 |[[1,3]] |[0, 0, 0, 3, 0] |
|17 |[[1,1]] |[0, 0, 0, 1, 0] |
|18 |[[1,2], [0,1]] |[0, 0, 0, 2, 1] |
|19 |[[1,7]] |[0, 0, 0, 7, 0] |
+---+---------------+----------------+
I hope the answer is helpful