Using Spark window function lead to create column in dataframe

Using Spark window function lead to create column in dataframe - scala

I would like to create a new column with value of the previous date(date less the current date) for group of ids for the below dataframe
+---+----------+-----+
| id| date|value|
+---+----------+-----+
| a|2015-04-11| 300|
| a|2015-04-12| 400|
| a|2015-04-12| 200|
| a|2015-04-12| 100|
| a|2015-04-11| 700|
| b|2015-04-02| 100|
| b|2015-04-12| 100|
| c|2015-04-12| 400|
+---+----------+-----+
I have tried with lead window function .
val df1=Seq(("a","2015-04-11",300),("a","2015-04-12",400),("a","2015-04-12",200),("a","2015-04-12",100),("a","2015-04-11",700),("b","2015-04-02",100),("b","2015-04-12",100),("c","2015-04-12",400)).toDF("id","date","value")
var w1=Window.partitionBy("id").orderBy("date".desc)
var leadc1=lead(df1("value"),1).over(w1)
val df2=df1.withColumn("nvalue",leadc1)
+---+----------+-----+------+
| id| date|value|nvalue|
+---+----------+-----+------+
| a|2015-04-12| 400| 200|
| a|2015-04-12| 200| 100|
| a|2015-04-12| 100| 300|
| a|2015-04-11| 300| 700|
| a|2015-04-11| 700| null|
| b|2015-04-12| 100| 100|
| b|2015-04-02| 100| null|
| c|2015-04-12| 400| null|
+---+----------+-----+------+
But as we can see when I have same date in id "a" I am getting wrong result.The result should be like
+---+----------+-----+------+
| id| date|value|nvalue|
+---+----------+-----+------+
| a|2015-04-12| 400| 300|
| a|2015-04-12| 200| 300|
| a|2015-04-12| 100| 300|
| a|2015-04-11| 300| null|
| a|2015-04-11| 700| null|
| b|2015-04-12| 100| 100|
| b|2015-04-02| 100| null|
| c|2015-04-12| 400| null|
+---+----------+-----+------+
I already have a solution using join although I am looking for a solution using window function.
Thanks

The issue is you have multiple rows with the same date. lead will take value from the next row in the result set, not the next date. So when you sort the rows by date in descending order, the next row could be the same date.
How do you identify the correct value to use for a particular date? for example why are you taking 300 from (id=a, date=2015-04-11), and not 700?
To do this with window functions you may need to do multiple passes - this would take the last nvalue and apply it to all rows in the same id/date grouping - but I'm not sure how your rows are initially ordered.
val df1=Seq(("a","2015-04-11",300),("a","2015-04-12",400),("a","2015-04-12",200),("a","2015-04-12",100),("a","2015-04-11",700),("b","2015-04-02",100),("b","2015-04-12",100),("c","2015-04-12",400)).toDF("id","date","value")
var w1 = Window.partitionBy("id").orderBy("date".desc)
var leadc1 = lead(df1("value"),1).over(w1)
val df2 = df1.withColumn("nvalue",leadc1)
val w2 = Window.partitionBy("id", "date").orderBy("??? some way to distinguish row ordering")
val df3 = df1.withColumn("nvalue2", last_value("nvalue").over(w2))

Related

PySpark - Getting the latest date less than another given date

I need some help. I have two dataframes, one has a few dates and the other has my significant data, catalogued by date.
It goes something like this:
First df, with the relevant data
+------+----------+---------------+
| id| test_date| score|
+------+----------+---------------+
| 1|2021-03-31| 94|
| 1|2021-01-31| 93|
| 1|2020-12-31| 100|
| 1|2020-06-30| 95|
| 1|2019-10-31| 58|
| 1|2017-10-31| 78|
| 2|2020-01-31| 79|
| 2|2018-03-31| 66|
| 2|2016-05-31| 77|
| 3|2021-05-31| 97|
| 3|2020-07-31| 100|
| 3|2019-07-31| 99|
| 3|2019-06-30| 98|
| 3|2018-07-31| 91|
| 3|2018-02-28| 86|
| 3|2017-11-30| 82|
+------+----------+---------------+
Second df, with the dates
+--------------+--------------+--------------+
| eval_date_1| eval_date_2| eval_date_3|
+--------------+--------------+--------------+
| 2021-01-31| 2020-10-31| 2019-06-30|
+--------------+--------------+--------------+
Needed DF
+------+--------------+---------+--------------+---------+--------------+---------+
| id| eval_date_1| score_1 | eval_date_2| score_2 | eval_date_3| score_3 |
+------+--------------+---------+--------------+---------+--------------+---------+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
+------+--------------+---------+--------------+---------+--------------+---------+
So, for instance, for the first id, the needed df takes the scores from the second, fourth and sixth rows from the first df. Those are the most updated dates that stay equal to or below the eval_date on the second df.

Assuming df is your main dataframe and df_date is the one which contains only dates.
from functools import reduce
from pyspark.sql import functions as F, Window as W
df_final = reduce(
lambda a, b: a.join(b, on="id"),
(
df.join(
F.broadcast(df_date.select(f"eval_date_{i}")),
on=F.col(f"eval_date_{i}") >= F.col("test_date"),
)
.withColumn(
"rnk",
F.row_number().over(W.partitionBy("id").orderBy(F.col("test_date").desc())),
)
.where("rnk=1")
.select("id", f"eval_date_{i}", "score")
for i in range(1, 4)
),
)
df_final.show()
+---+-----------+-----+-----------+-----+-----------+-----+
| id|eval_date_1|score|eval_date_2|score|eval_date_3|score|
+---+-----------+-----+-----------+-----+-----------+-----+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
+---+-----------+-----+-----------+-----+-----------+-----+

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.

If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Create new column from existing Dataframe

I am having a Dataframe and trying to create a new column from existing columns based on following condition.
Group data by column named event_type
Only filter those rows where column source has value train and call it X.
Values for new column are X.sum / X.length
Here is input Dataframe
+-----+-------------+----------+--------------+------+
| id| event_type| location|fault_severity|source|
+-----+-------------+----------+--------------+------+
| 6597|event_type 11|location 1| -1| test|
| 8011|event_type 15|location 1| 0| train|
| 2597|event_type 15|location 1| -1| test|
| 5022|event_type 15|location 1| -1| test|
| 5022|event_type 11|location 1| -1| test|
| 6852|event_type 11|location 1| -1| test|
| 6852|event_type 15|location 1| -1| test|
| 5611|event_type 15|location 1| -1| test|
|14838|event_type 15|location 1| -1| test|
|14838|event_type 11|location 1| -1| test|
| 2588|event_type 15|location 1| 0| train|
| 2588|event_type 11|location 1| 0| train|
+-----+-------------+----------+--------------+------+
and i want following output.
+--------------+------------+-----------+
| | event_type | PercTrain |
+--------------+------------+-----------+
|event_type 11 | 7888 | 0.388945 |
|event_type 35 | 6615 | 0.407105 |
|event_type 34 | 5927 | 0.406783 |
|event_type 15 | 4395 | 0.392264 |
|event_type 20 | 1458 | 0.382030 |
+--------------+------------+-----------+
I have tried this code but this throws error
EventSet.withColumn("z" , when($"source" === "train" , sum($"source") / length($"source"))).groupBy("fault_severity").count().show()
Here EventSet is input dataframe
Python code that gives desired output is
event_type_unq['PercTrain'] = event_type.pivot_table(values='source',index='event_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))

I guess you want to obtain the percentage of train values. So, here is my code,
val df2 = df.select($"event_type", $"source").groupBy($"event_type").pivot($"source").agg(count($"source")).withColumn("PercTrain", $"train" / ($"train" + $"test")).show
and gives the result as follows:
+-------------+----+-----+------------------+
| event_type|test|train| PercTrain|
+-------------+----+-----+------------------+
|event_type 11| 4| 1| 0.2|
|event_type 15| 5| 2|0.2857142857142857|
+-------------+----+-----+------------------+
Hope to be helpful.

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?

we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!

Spark Dataframe - Windowing Function - Lag & Lead for Insert & Update output

I need to perform the below operation on dataframes using Windowing function Lag and Lead.
For each Key, I need to perform the below Insert and update in the final output
Insert Condition:
1. By Default, LAYER_NO=0 , needs to be written in output.
2. If there is any change in the value of COL1,COL2,COL3, with respective to its precious record,then that records needs to be written in output.
Example: key_1 with layer_no=2, there is a change of value from 400 to 600 in COL3
Update Condition:
1. If there were NO changes in the value of COL1,COL2,COL3, with respective to its previous record,but there is a change in "DEPART column", this value needs to be updated in the output.
Example: key_1 with layer_no=3, there were NO changes in COL1,COL2,COL3, But there is value change in DEPART column as "xyz" , so this needs to be updated in the output.
2. Even the LAYER_NO should be updated sequentially, after inserting the record with layer_no=0
val inputDF = values.toDF("KEY","LAYER_NO","COl1","COl2","COl3","DEPART")
inputDF.show()
+-----+--------+----+----+----+------+
| KEY|LAYER_NO|COL1|COL2|COL3|DEPART|
+-----+--------+----+----+----+------+
|key_1| 0| 200| 300| 400| abc|->default write
|key_1| 1| 200| 300| 400| abc|
|key_1| 2| 200| 300| 600| uil|--->change in col3,so write
|key_1| 2| 200| 300| 600| uil|
|key_1| 3| 200| 300| 600| xyz|--->change in col4,so update
|key_2| 0| 500| 700| 900| prq|->default write
|key_2| 1| 888| 555| 900| tep|--->change in col1 & col 2,so write
|key_3| 0| 111| 222| 333| lgh|->default write
|key_3| 1| 084| 222| 333| lgh|--->change in col1,so write
|key_3| 2| 084| 222| 333| rrr|--->change in col4,so update
+-----+--------+----+----+----+------+
Expected Output:
outputDF.show()
+-----+--------+----+----+----+------+
| KEY|LAYER_NO|COl1|COl2|COl3|DEPART|
+-----+--------+----+----+----+------+
|key_1| 0| 200| 300| 400| abc|
|key_1| 1| 200| 300| 600| xyz|
|key_2| 0| 500| 700| 900| prq|
|key_2| 1| 888| 555| 900| tep|
|key_3| 0| 111| 222| 333| lgh|
|key_3| 1| 084| 222| 333| rrr|
+-----+--------+----+----+----+------+

We need to define two Window's to arrive at your expected output. One for checking the change in the DEPART column, the second for checking the difference in the sum of COL1 to COL3.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val w_col = Window.partitionBy("KEY", "COL1", "COL2", "COL3").orderBy("LAYER_NO")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val w_key = Window.partitionBy("KEY").orderBy("LAYER_NO")
Then we simply replace the values in DEPART column by the correct values, and filter the data to rows where the lagged sum differs from the current sum of columns (and rows where LAYER_NO === 0). Lastly, we replace LAYER_NO by rank.
inputDF.withColumn("DEPART", last("DEPART").over(w_col))
.withColumn("row_sum",($"COL1" + $"COL2" + $"COL3"))
.withColumn("lag_sum", lag($"row_sum",1).over(w_key))
.filter($"LAYER_NO" === 0 || not($"row_sum" === $"lag_sum"))
.withColumn("LAYER_NO", rank.over(w_key)-1)
.drop("row_sum", "lag_sum").show()
+-----+--------+----+----+----+------+
| KEY|LAYER_NO|COl1|COl2|COl3|DEPART|
+-----+--------+----+----+----+------+
|key_1| 0| 200| 300| 400| abc|
|key_1| 1| 200| 300| 600| xyz|
|key_2| 0| 500| 700| 900| prq|
|key_2| 1| 888| 555| 900| tep|
|key_3| 0| 111| 222| 333| lgh|
|key_3| 1| 084| 222| 333| rrr|
+-----+--------+----+----+----+------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using Spark window function lead to create column in dataframe - scala

Related

PySpark - Getting the latest date less than another given date

Spark adding indexes to dataframe and append other dataset that doesn't have index

Create new column from existing Dataframe

Forward-fill missing data in PySpark not working

Spark Dataframe - Windowing Function - Lag & Lead for Insert & Update output

Categories

Resources