My spark dataframe is;
Client Date Due_Day
A 2017-01-01 Null
A 2017-02-01 Null
A 2017-03-01 Null
A 2017-04-01 Null
A 2017-05-01 Null
A 2017-06-01 35
A 2017-07-01 Null
A 2017-08-01 Null
A 2017-09-01 Null
A 2017-10-01 Null
A 2017-11-01 Null
A 2017-12-01 Null
B 2017-01-01 Null
B 2017-02-01 Null
B 2017-03-01 Null
B 2017-04-01 Null
B 2017-05-01 Null
B 2017-06-01 Null
B 2017-07-01 Null
B 2017-08-01 Null
B 2017-09-01 Null
B 2017-10-01 78
B 2017-11-01 Null
B 2017-12-01 Null
There is one non-NULL Due_Day for the same Client in the dataframe.
Desired output is;
Client Date Due_Day Result
A 2017-01-01 Null -115
A 2017-02-01 Null -85
A 2017-03-01 Null -55 -> -25 - 30 = -55
A 2017-04-01 Null -25 -> 5 - 30 = -25
A 2017-05-01 Null 5 -> 35 - 30 = 5
A 2017-06-01 35 35
A 2017-07-01 Null Null -> Still Same value (null)
A 2017-08-01 Null Null -> Still Same value (null)
A 2017-09-01 Null Null
A 2017-10-01 Null Null
A 2017-11-01 Null Null
A 2017-12-01 Null Null
B 2017-01-01 Null -192
B 2017-02-01 Null -162
B 2017-03-01 Null -132
B 2017-04-01 Null -102
B 2017-05-01 Null -72
B 2017-06-01 Null -42
B 2017-07-01 Null -12
B 2017-08-01 Null 18 -> 48 - 30 = 18
B 2017-09-01 Null 48 -> 78 - 30 = 48
B 2017-10-01 78 78
B 2017-11-01 Null Null -> Still Same value (null)
B 2017-12-01 Null Null -> Still Same value (null)
Until the beginning of the year for each client, values in the Result column should decrease by 30 days each month before the non-null Due_Day value.
Could you please help me about pyspark code?
This can be solved by identifying the last non-null Due_Day and the corresponding row number. To compute the result, subtract 30 from the last non-null Due_Day as many times as the number of rows between between the current row and the row containing the last non-null Due_Day.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col as c
data = [("A", "2017-01-01", None,),
("A", "2017-02-01", None,),
("A", "2017-03-01", None,),
("A", "2017-04-01", None,),
("A", "2017-05-01", None,),
("A", "2017-06-01", 35,),
("A", "2017-07-01", None,),
("A", "2017-08-01", None,),
("A", "2017-09-01", None,),
("A", "2017-10-01", None,),
("A", "2017-11-01", None,),
("A", "2017-12-01", None,),
("B", "2017-01-01", None,),
("B", "2017-02-01", None,),
("B", "2017-03-01", None,),
("B", "2017-04-01", None,),
("B", "2017-05-01", None,),
("B", "2017-06-01", None,),
("B", "2017-07-01", None,),
("B", "2017-08-01", None,),
("B", "2017-09-01", None,),
("B", "2017-10-01", 78,),
("B", "2017-11-01", None,),
("B", "2017-12-01", None,), ]
df = spark.createDataFrame(data, ("Client", "Date", "Due_Day",)).withColumn("Date", F.to_date(F.col("Date"), "yyyy-MM-dd"))
window_spec = W.partitionBy("Client").orderBy("Date")
df.withColumn("rn", F.row_number().over(window_spec))\
.withColumn("nonNullRn", F.when(c("Due_Day").isNull(), F.lit(None)).otherwise(c("rn")))\
.withColumn("nonNullDue_Day", F.last("Due_Day", ignorenulls=True).over(window_spec.rowsBetween(W.currentRow, W.unboundedFollowing)))\
.withColumn("nonNullRn", F.last("nonNullRn", ignorenulls=True).over(window_spec.rowsBetween(W.currentRow, W.unboundedFollowing)))\
.withColumn("Result", c("nonNullDue_Day") - (F.lit(30) * (c("nonNullRn") - c("rn"))))\
.select("Client", "Date", "Due_Day", "Result")\
.show(200)
Output
+------+----------+-------+------+
|Client| Date|Due_Day|Result|
+------+----------+-------+------+
| A|2017-01-01| null| -115|
| A|2017-02-01| null| -85|
| A|2017-03-01| null| -55|
| A|2017-04-01| null| -25|
| A|2017-05-01| null| 5|
| A|2017-06-01| 35| 35|
| A|2017-07-01| null| null|
| A|2017-08-01| null| null|
| A|2017-09-01| null| null|
| A|2017-10-01| null| null|
| A|2017-11-01| null| null|
| A|2017-12-01| null| null|
| B|2017-01-01| null| -192|
| B|2017-02-01| null| -162|
| B|2017-03-01| null| -132|
| B|2017-04-01| null| -102|
| B|2017-05-01| null| -72|
| B|2017-06-01| null| -42|
| B|2017-07-01| null| -12|
| B|2017-08-01| null| 18|
| B|2017-09-01| null| 48|
| B|2017-10-01| 78| 78|
| B|2017-11-01| null| null|
| B|2017-12-01| null| null|
+------+----------+-------+------+
Related
I have following table loaded as a dataframe :
Id Name customCount Custom1 Custom1value custom2 custom2Value custom3 custom3Value
1 qwerty 2 Height 171 Age 76 Null Null
2 asdfg 2 Weight 78 Height 166 Null Null
3 zxcvb 3 Age 28 SkinColor white Height 67
4 tyuio 1 Height 177 Null Null Null Null
5 asdfgh 2 SkinColor brown Age 34 Null Null
I need to change this table into below format :
Id Name customCount Height Weight Age SkinColor
1 qwerty 2 171 Null 76 Null
2 asdfg 2 161 78 Null Null
3 zxcvb 3 67 Null 28 white
4 tyuio 1 177 Null Null Null
5 asdfgh 2 Null Null 34 brown
I tried for two custom fields columns :
val rawDf= spark.read.option("Header",false).options(Map("sep"->"|")).csv("/sample/data.csv")
rawDf.createOrReplaceTempView("Table")
val dataframe=spark.sql("select distinct * from (select `_c3` from Table union select `_c5` from Table)")
val dfWithDistinctColumns=dataframe.toDF("colNames")
val list=dfWithDistinctColumns.select("colNames").map(x=>x.getString(0)).collect().toList
val rawDfWithSchema=rawDf.toDF("Id","Name",customCount","h1","v1","h2","v2")
val expectedDf=list.foldLeft(rawDfWithSchema)((df1,c)=>(df1.withColumn(c, when(col("h1")===c,col("v1")).when(col("h2")===c,col("v2")).otherwise(null)))).drop("h1","h2","v1","v2")
But I am not able to do union on multiple columns when I try it on 3 custom fields .
Can you please give any idea/solution for this?
You can do a pivot, but you also need to clean up the format of the dataframe first:
val df2 = df.select(
$"Id", $"Name", $"customCount",
explode(array(
array($"Custom1", $"Custom1value"),
array($"custom2", $"custom2Value"),
array($"custom3", $"custom3Value")
)).alias("custom")
).select(
$"Id", $"Name", $"customCount",
$"custom"(0).alias("key"),
$"custom"(1).alias("value")
).groupBy(
"Id", "Name", "customCount"
).pivot("key").agg(first("value")).drop("null").orderBy("Id")
df2.show
+---+------+-----------+----+------+---------+------+
| Id| Name|customCount| Age|Height|SkinColor|Weight|
+---+------+-----------+----+------+---------+------+
| 1|qwerty| 2| 76| 171| null| null|
| 2| asdfg| 2|null| 166| null| 78|
| 3| zxcvb| 3| 28| 67| white| null|
| 4| tyuio| 1|null| 177| null| null|
| 5|asdfgh| 2| 34| null| brown| null|
+---+------+-----------+----+------+---------+------+
Find top N Game for every id who watched based on total time so here is my input dataframe:
InputDF:
id | Game | Time
1 A 10
2 B 100
1 A 100
2 C 105
1 N 103
2 B 102
1 N 90
2 C 110
And this is the output that I am expecting:
OutputDF:
id | Game | Time|
1 N 193
1 A 110
2 C 215
2 B 202
Here what I have tried but it is not working as expected:
val windowDF = Window.partitionBy($"id").orderBy($"Time".desc)
InputDF.withColumn("rank", row_number().over(windowDF))
.filter("rank<=10")
Your top-N ranking applies only to individual time rather than total time per game. A groupBy/sum to compute total time followed by a ranking on the total time will do:
val df = Seq(
(1, "A", 10),
(2, "B", 100),
(1, "A", 100),
(2, "C", 105),
(1, "N", 103),
(2, "B", 102),
(1, "N", 90),
(2, "C", 110)
).toDF("id", "game", "time")
import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy($"id").orderBy($"total_time".desc)
df.
groupBy("id", "game").agg(sum("time").as("total_time")).
withColumn("rank", row_number.over(win)).
where($"rank" <= 10).
show
// +---+----+----------+----+
// | id|game|total_time|rank|
// +---+----+----------+----+
// | 1| N| 193| 1|
// | 1| A| 110| 2|
// | 2| C| 215| 1|
// | 2| B| 202| 2|
// +---+----+----------+----+
I have a dataframe with the below structure :
+----------+------+------+----------------+--------+------+
| date|market|metric|aggregator_Value|type |rank |
+----------+------+------+----------------+--------+------+
|2018-08-05| m1| 16 | m1|median | 1 |
|2018-08-03| m1| 5 | m1|median | 2 |
|2018-08-01| m1| 10 | m1|mean | 3 |
|2018-08-05| m2| 35 | m2|mean | 1 |
|2018-08-03| m2| 25 | m2|mean | 2 |
|2018-08-01| m2| 5 | m2|mean | 3 |
+----------+------+------+----------------+--------+------+
In this dataframe the rank column is calculated on the order of date and groupings of the market column.
Like this
val w_rank = Window.partitionBy("market").orderBy(desc("date"))
val outputDF2=outputDF1.withColumn("rank",rank().over(w_rank))
I want to extract the concatenated value of the metric column in the output dataframe when the rank=1 , with the condition that if the type="median" in rank=1 row is then concatenate all the metric values with that market .Otherwise if the type="mean" in rank=1 row , then concatenate only the previous 2 metric values .Like this
+----------+------+------+----------------+--------+---------+
| date|market|metric|aggregator_Value|type |result |
+----------+------+------+----------------+--------+---------+
|2018-08-05| m1| 16 | m1|median |10|5|16 |
|2018-08-05| m2| 35 | m1|mean |25|35 |
+----------+------+------+----------------+--------+---------+
How can I achieve this ?
You could nullify column metric according to the specific condition and apply collect_list followed by concat_ws to get the wanted result, as show below:
val df = Seq(
("2018-08-05", "m1", 16, "m1", "median", 1),
("2018-08-03", "m1", 5, "m1", "median", 2),
("2018-08-01", "m1", 10, "m1", "mean", 3),
("2018-08-05", "m2", 35, "m2", "mean", 1),
("2018-08-03", "m2", 25, "m2", "mean", 2),
("2018-08-01", "m2", 5, "m2", "mean", 3)
).toDF("date", "market", "metric", "aggregator_value", "type", "rank")
val win_desc = Window.partitionBy("market").orderBy(desc("date"))
val win_asc = Window.partitionBy("market").orderBy(asc("date"))
df.
withColumn("rank1_type", first($"type").over(win_desc.rowsBetween(Window.unboundedPreceding, 0))).
withColumn("cond_metric", when($"rank1_type" === "mean" && $"rank" > 2, null).otherwise($"metric")).
withColumn("result", concat_ws("|", collect_list("cond_metric").over(win_asc))).
where($"rank" === 1).
show
// +----------+------+------+----------------+------+----+----------+-----------+-------+
// | date|market|metric|aggregator_value| type|rank|rank1_type|cond_metric| result|
// +----------+------+------+----------------+------+----+----------+-----------+-------+
// |2018-08-05| m1| 16| m1|median| 1| median| 16|10|5|16|
// |2018-08-05| m2| 35| m2| mean| 1| mean| 35| 25|35|
// +----------+------+------+----------------+------+----+----------+-----------+-------+
I have a dataframe like this:
person_id ar_id new_value
101 5 Y
102 6 N
103 7 Full Time
104 8 Training
When I am executing:
val ar_id = Seq("5","6","7","8")
df.groupBy("person_id").pivot("ar_id",ar_id).agg(expr("coalesce(first(new_value), \"null\")"))
The output I am getting is:
person_id 5 6 7 8
101 Y null null null
102 null N null null
103 null null Time null
104 null null null Trainer
But my requirement is to have each value a different column name say 5 is status, 6 is manager, 7 is availability and 8 is role. like below:
person_id status manager availability role
101 Y null null null
102 null N null null
103 null null Time null
104 null null null Trainer
Please help. Thanks
If you want to rename columns 5,6,7,8 to status, manager, availability, role, you can do the following:
val renames = Map("5"->"status", "6"->"manager", "7"->"availability", "8"->"role")
val newDF = renames.foldLeft(df){case (d,(key,val)) => d.withColumnRenamed(key,val))
Spark 2.4.3
scala> var df= spark.createDataFrame(Seq((101,5,"Y"),(102,6,"N"),(103,7,"Full Time"),(104,8,"Training"))).toDF("person_id", "ar_id" ,"new_value")
scala> var df_v1 = df.groupBy("person_id").pivot($"ar_id").agg(expr("coalesce(first(new_value), \"null\")"))
scala> df_v1.show
+---------+----+----+---------+--------+
|person_id| 5| 6| 7| 8|
+---------+----+----+---------+--------+
| 101| Y|null| null| null|
| 103|null|null|Full Time| null|
| 102|null| N| null| null|
| 104|null|null| null|Training|
+---------+----+----+---------+--------+
1.create a Map for columns to be mapped
scala> val lookup = Map("5" -> "status", "6" -> "manager","7" -> "availability","8" -> "role")
2.then use map function to rename the columns
scala> df_v1.select(df_v1.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*).show()
+---------+------+-------+------------+--------+
|person_id|status|manager|availability| role|
+---------+------+-------+------------+--------+
| 101| Y| null| null| null|
| 103| null| null| Full Time| null|
| 102| null| N| null| null|
| 104| null| null| null|Training|
+---------+------+-------+------------+--------+
Hope this helps you retrieve your desired output let me know if you have any question-related o same and if it solves your purpose don't forget to accept the answer.thanks, HAppy HAdoop
**DF1** **DF2** **output_DF**
120 D A 120 null A
120 E B 120 null B
125 F C 120 null C
D 120 D D
E 120 E E
F 120 null F
G 120 null G
H 120 null H
125 null A
125 null B
125 null C
125 null D
125 null E
125 F F
125 null G
125 null H
From dataframe 1 and 2 need to get the final output dataframe in spark-shell.
where A,B,C,D,E,F are in date format(yyyy-MM-dd) & 120,125 are the ticket_id's column where there are thousands of ticket_id's.
I just extracted one out of it here.
To get the expected result you can use df.join() and df.na.fill() (as mentioned in comments), like this:
For Spark 2.0+
val resultDF = df1.select("col1").distinct.collect.map(_.getInt(0)).map(id => df1.filter(s"col1 = $id").join(df2, df1("col2") === df2("value"), "right").na.fill(id)).reduce(_ union _)
For Spark 1.6
val resultDF = df1.select("col1").distinct.collect.map(_.getInt(0)).map(id => df1.filter(s"col1 = $id").join(df2, df1("col2") === df2("value"), "right").na.fill(id)).reduce(_ unionAll _)
It will give you the following result -
+---+----+-----+
|120|null| A|
|120|null| B|
|120|null| C|
|120| D| D|
|120| E| E|
|120|null| F|
|120|null| G|
|120|null| H|
|125|null| A|
|125|null| B|
|125|null| C|
|125|null| D|
|125|null| E|
|125| F| F|
|125|null| G|
|125|null| H|
+---+----+-----+
I hope it helps!
Full join of possible values, then left join with original dataframe:
import hiveContext.implicits._
val df1Data = List((120, "D"), (120, "E"), (125, "F"))
val df2Data = List("A", "B", "C", "D", "E", "F", "G", "H")
val df1 = sparkContext.parallelize(df1Data).toDF("id", "date")
val df2 = sparkContext.parallelize(df2Data).toDF("date")
// get unique ID: 120, 125
val uniqueIDDF = df1.select(col("id")).distinct()
val fullJoin = uniqueIDDF.join(df2)
val result = fullJoin.as("full").join(df1.as("df1"), col("full.id") === col("df1.id") && col("full.date") === col("df1.date"), "left_outer")
val sorted = result.select(col("full.id"), col("df1.date"), col("full.date")).sort(col("full.id"), col("full.date"))
sorted.show(false)
output:
+---+----+----+
|id |date|date|
+---+----+----+
|120|null|A |
|120|null|B |
|120|null|C |
|120|D |D |
|120|E |E |
|120|null|F |
|120|null|G |
|120|null|H |
|125|null|A |
|125|null|B |
|125|null|C |
|125|null|D |
|125|null|E |
|125|F |F |
|125|null|G |
|125|null|H |
+---+----+----+
Sorting here just for show the same result, can be skipped.