Calculate residual amount in dataframe column - scala

I have a "capacity" dataframe:
scala> sql("create table capacity (id String, capacity Int)");
scala> sql("insert into capacity values ('A', 50), ('B', 100)");
scala> sql("select * from capacity").show(false)
+---+--------+
|id |capacity|
+---+--------+
|A |50 |
|B |100 |
+---+--------+
I have another "used" dataframe with following information:
scala> sql ("create table used (id String, capacityId String, used Int)");
scala> sql ("insert into used values ('item1', 'A', 10), ('item2', 'A', 20), ('item3', 'A', 10), ('item4', 'B', 30), ('item5', 'B', 40), ('item6', 'B', 40)")
scala> sql("select * from used order by capacityId").show(false)
+-----+----------+----+
|id |capacityId|used|
+-----+----------+----+
|item1|A |10 |
|item3|A |10 |
|item2|A |20 |
|item6|B |40 |
|item4|B |30 |
|item5|B |40 |
+-----+----------+----+
Column "capacityId" of the "used" dataframe is foreign key to column "id" of the "capacity" dataframe.
I want to calculate the "capacityLeft" column which is residual amount at that point of time.
+-----+----------+----+--------------+
|id |capacityId|used| capacityLeft |
+-----+----------+----+--------------+
|item1|A |10 |40 | <- 50(capacity of 'A')-10
|item3|A |10 |30 | <- 40-10
|item2|A |20 |10 | <- 30-20
|item6|B |40 |60 | <- 100(capacity of 'B')-40
|item4|B |30 |30 | <- 60-30
|item5|B |40 |-10 | <- 30-40
+-----+----------+----+--------------+
In real senario, the "createdDate" column is used for ordering of "used" dataframe column.
Spark version: 2.2

This can be solved by using window functions in Spark. Note that for this to work there need to exist a column that keep track of the row order for each capacityId.
Start by joining the two dataframes together:
val df = used.join(capacity.withColumnRenamed("id", "capacityId"), Seq("capacityId"), "inner")
Here the id in the capacity dataframe is renamed to match the id name in the used dataframe as to not keep a duplicate columns.
Now create a window and calculate the cumsum of the used column. Take the value of the capacity and subtract the cumsum to get the remaining amount:
val w = Window.partitionBy("capacityId").orderBy("createdDate")
val df2 = df.withColumn("capacityLeft", $"capacity" - sum($"used").over(w))
Resulting dataframe with example createdDate column:
+----------+-----+----+-----------+--------+------------+
|capacityId| id|used|createdDate|capacity|capacityLeft|
+----------+-----+----+-----------+--------+------------+
| B|item6| 40| 1| 100| 60|
| B|item4| 30| 2| 100| 30|
| B|item5| 40| 3| 100| -10|
| A|item1| 10| 1| 50| 40|
| A|item3| 10| 2| 50| 30|
| A|item2| 20| 3| 50| 10|
+----------+-----+----+-----------+--------+------------+
Any unwanted columns can now be removed with drop.

Related

how to group by week number and show 3 weeks data in a tuple?

Hi guys i wanted to change this in pyspark dataframe
| player_id | stat_week|moves|week_num|
|1 |2022-06-13| 1| 24|
|1 |2022-06-06| 20| 23|
|1 |2022-06-20| 0| 25|
|2 |2022-06-06| 20| 23|
|2 |2022-06-13| 0| 24|
|2 |2022-06-20| 0| 25|
|1 |2022-05-30| 10| 22|
|1 |2022-05-23| 20| 21|
|1 |2022-05-16| 20| 20|
into
| player_id |moves |week_num | group by
|1 |(20,20,10)|(20,21,22) | 1
|1 |(20,1,0). |(23,24,25) | 2
|2 |(20,0,0). |(23,24,25) | 1
how can i do to group 3 weeks data and aggregate them as tuple ?
any help appreciated~~
First of all, there is no python tuple data structure in pyspark with immutable feature and multiple data types, you need to write your own data type in pyspark. If you just want to collect the element in array-like data type, you can use ArrayType in pyspark.
You can create a week_num reference dataframe and then join back to your dataframe to do the grouping:
week_lst = [int(row['week_num']) for row in df.select('week_num').distinct().orderBy('week_num').collect()]
group_lst = [j//3 for j in range(len(week_lst))]
week_reference = spark.createDataFrame(
data=[(week, group) for week, group in zip(week_lst, group_lst)],
schema=['week_num', 'group']
)
df = df.join(week_reference, on='week_num', how='left')
Then you can do your grouping by:
df.groupby('player_id', 'group')\
.agg(collect_list('moves').alias('moves'),
collect_list('week_num').alias('week_num'))
If you want to follow the order in the array, you can check this post: collect_list by preserving order based on another variable

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.
You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/
Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data
This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Finding Percentile in Spark-Scala per a group

I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile definition over a group.
val df1 = Seq(
(1, 10.0), (1, 20.0), (1, 40.6), (1, 15.6), (1, 17.6), (1, 25.6),
(1, 39.6), (2, 20.5), (2 ,70.3), (2, 69.4), (2, 74.4), (2, 45.4),
(3, 60.6), (3, 80.6), (4, 30.6), (4, 90.6)
).toDF("ID","Count")
val idBucketMapping = Seq((1, 4), (2, 3), (3, 2), (4, 2))
.toDF("ID", "Bucket")
//jpp
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.expressions.Window
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column,
accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column =
percentile_approx(col, percentage,
lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
}
import PercentileApprox._
var res = df1
.withColumn("percentile",
percentile_approx(col("count"), typedLit(doBucketing(2)))
.over(Window.partitionBy("ID"))
)
def doBucketing(bucket_size : Int) = (1 until bucket_size)
.scanLeft(0d)((a, _) => a + (1 / bucket_size.toDouble))
scala> df1.show
+---+-----+
| ID|Count|
+---+-----+
| 1| 10.0|
| 1| 20.0|
| 1| 40.6|
| 1| 15.6|
| 1| 17.6|
| 1| 25.6|
| 1| 39.6|
| 2| 20.5|
| 2| 70.3|
| 2| 69.4|
| 2| 74.4|
| 2| 45.4|
| 3| 60.6|
| 3| 80.6|
| 4| 30.6|
| 4| 90.6|
+---+-----+
scala> idBucketMapping.show
+---+------+
| ID|Bucket|
+---+------+
| 1| 4|
| 2| 3|
| 3| 2|
| 4| 2|
+---+------+
scala> res.show
+---+-----+------------------+
| ID|Count| percentile|
+---+-----+------------------+
| 1| 10.0|[10.0, 20.0, 40.6]|
| 1| 20.0|[10.0, 20.0, 40.6]|
| 1| 40.6|[10.0, 20.0, 40.6]|
| 1| 15.6|[10.0, 20.0, 40.6]|
| 1| 17.6|[10.0, 20.0, 40.6]|
| 1| 25.6|[10.0, 20.0, 40.6]|
| 1| 39.6|[10.0, 20.0, 40.6]|
| 3| 60.6|[60.6, 60.6, 80.6]|
| 3| 80.6|[60.6, 60.6, 80.6]|
| 4| 30.6|[30.6, 30.6, 90.6]|
| 4| 90.6|[30.6, 30.6, 90.6]|
| 2| 20.5|[20.5, 69.4, 74.4]|
| 2| 70.3|[20.5, 69.4, 74.4]|
| 2| 69.4|[20.5, 69.4, 74.4]|
| 2| 74.4|[20.5, 69.4, 74.4]|
| 2| 45.4|[20.5, 69.4, 74.4]|
+---+-----+------------------+
Upto here it is well and good and the logic is simple. But I need results in a dynamic fashion. This means the argument doBucketing(2) to this function should be taken from idBucketMapping based on the ID - Value.
This seems to be little bit tricky for me. Is this possible by any means?
Expected Output --
This means the percentile bucket is based on - idBucketMapping Dataframe .
+---+-----+------------------------+
|ID |Count|percentile |
+---+-----+------------------------+
|1 |10.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |20.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |40.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |15.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |17.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |25.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |39.6 |[10.0, 15.6, 20.0, 39.6]|
|3 |60.6 |[60.6, 60.6] |
|3 |80.6 |[60.6, 60.6] |
|4 |30.6 |[30.6, 30.6] |
|4 |90.6 |[30.6, 30.6] |
|2 |20.5 |[20.5, 45.4, 70.3] |
|2 |70.3 |[20.5, 45.4, 70.3] |
|2 |69.4 |[20.5, 45.4, 70.3] |
|2 |74.4 |[20.5, 45.4, 70.3] |
|2 |45.4 |[20.5, 45.4, 70.3] |
+---+-----+------------------------+
I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.
My first version is very ugly.
// for the sake of clarity, let's define a function that generates the
// window aggregation
def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
.over(Window.partitionBy("ID"))
// then, we simply try to match the Bucket column with a possible value
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", when('Bucket === 2, per(2)
.otherwise(when('Bucket === 3, per(3))
.otherwise(per(4)))
)
That's nasty but it works in your case.
Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.
val possible_number_of_buckets = 2 to 5
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", possible_number_of_buckets
.tail
.foldLeft(per(possible_number_of_buckets.head))
((column, size) => when('Bucket === size, per(size))
.otherwise(column)))
percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx at runtime with dynamically calculated percentage and accuracy.
ref- apache spark git percentile_approx source

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))

How to perform division operation in dataFrame Spark using Scala?

I have a dataFrame something like below.
+---+---+-----+
|uId| Id| sum |
+---+---+-----+
| 3| 1| 1.0|
| 7| 1| 1.0|
| 1| 2| 3.0|
| 1| 1| 1.0|
| 6| 5| 1.0|
using above DataFrame, I want to generate new DataFrame mention below
Sum column should be :-
For example:
For uid=3 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=7 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=1 and id=2, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
3.0*1/1=3.0
For uid=6 and id=5, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/1=1.0
My final output should be:
+---+---+---------+
|uId| Id| sum |
+---+---+---------+
| 3| 1| 0.33333|
| 7| 1| 0.33333|
| 1| 2| 3.0 |
| 1| 1| 0.3333 |
| 6| 5| 1.0 |
You can use Window function to get the count of each group of id column and finally use that count to divide the original sum
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("id")
import org.apache.spark.sql.functions._
df.withColumn("sum", $"sum"/count("id").over(windowSpec))
you should have the final dataframe as
+---+---+------------------+
|uId|Id |sum |
+---+---+------------------+
|3 |1 |0.3333333333333333|
|7 |1 |0.3333333333333333|
|1 |1 |0.3333333333333333|
|6 |5 |1.0 |
|1 |2 |3.0 |
+---+---+------------------+