Dataframe get first and last value of corresponding column - scala

Is it possible to get first value of the corresponding column within subgroup.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{Window, WindowSpec}
object tmp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val input = Seq(
(1235, 1, 1101, 0),
(1235, 2, 1102, 0),
(1235, 3, 1103, 1),
(1235, 4, 1104, 1),
(1235, 5, 1105, 0),
(1235, 6, 1106, 0),
(1235, 7, 1107, 1),
(1235, 8, 1108, 1),
(1235, 9, 1109, 1),
(1235, 10, 1110, 0),
(1235, 11, 1111, 0)
).toDF("SERVICE_ID", "COUNTER", "EVENT_ID", "FLAG")
lazy val window: WindowSpec = Window.partitionBy("SERVICE_ID").orderBy("COUNTER")
val firsts = input.withColumn("first_value", first("EVENT_ID", ignoreNulls = true).over(window.rangeBetween(Long.MinValue, Long.MaxValue)))
firsts.orderBy("SERVICE_ID", "COUNTER").show()
}
}
Output I want.
First (or Previous) value of column EVENT_ID based on FLAG = 1
And
Last (or Next ) value of column EVENT_ID based on FLAG = 1
partition by SERVICE_ID sorted by counter
+----------+-------+--------+----+-----------+-----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|last_value|
+----------+-------+--------+----+-----------+-----------+
| 1235| 1| 1101| 0| 0| 1103|
| 1235| 2| 1102| 0| 0| 1103|
| 1235| 3| 1103| 1| 0| 1106|
| 1235| 4| 1104| 0| 1103| 1106|
| 1235| 5| 1105| 0| 1103| 1106|
| 1235| 6| 1106| 1| 0| 1108|
| 1235| 7| 1107| 0| 1106| 1108|
| 1235| 8| 1108| 1| 0| 1109|
| 1235| 9| 1109| 1| 0| 1110|
| 1235| 10| 1110| 1| 0| 0|
| 1235| 11| 1111| 0| 1110| 0|
| 1235| 12| 1112| 0| 1110| 0|
+----------+-------+--------+----+-----------+-----------+

First the dataframe need to be formed into groups. A new group starts at each time the "TIME" column equals 1. To do this, first add a column "ID" to the dataframe:
lazy val window: WindowSpec = Window.partitionBy("SERVICE_ID").orderBy("COUNTER")
val df_flag = input.filter($"FLAG" === 1)
.withColumn("ID", row_number().over(window))
val df_other = input.filter($"FLAG" =!= 1)
.withColumn("ID", lit(0))
// Create a group for each flag event
val df = df_flag.union(df_other)
.withColumn("ID", max("ID").over(window.rowsBetween(Long.MinValue, 0)))
.cache()
df.show() gives:
+----------+-------+--------+----+---+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG| ID|
+----------+-------+--------+----+---+
| 1235| 1| 1111| 1| 1|
| 1235| 2| 1112| 0| 1|
| 1235| 3| 1114| 0| 1|
| 1235| 4| 2221| 1| 2|
| 1235| 5| 2225| 0| 2|
| 1235| 6| 2226| 0| 2|
| 1235| 7| 2227| 1| 3|
+----------+-------+--------+----+---+
Now that we have a column separating the events, we need to add the correct "EVENT_ID" (renamed "first_value") to each event. In addition to the "first_value", calculate and add a second column "last_value", which is the id of the next flagged event.
val df_event = df.filter($"FLAG" === 1)
.select("EVENT_ID", "ID", "SERVICE_ID", "COUNTER")
.withColumnRenamed("EVENT_ID", "first_value")
.withColumn("last_value", lead($"first_value",1,0).over(window))
.drop("COUNTER")
val df_final = df.join(df_event, Seq("ID", "SERVICE_ID"))
.drop("ID")
.withColumn("first_value", when($"FLAG" === 1, lit(0)).otherwise($"first_value"))
df_final.show() gives us:
+----------+-------+--------+----+-----------+----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|last_value|
+----------+-------+--------+----+-----------+----------+
| 1235| 1| 1111| 1| 0| 2221|
| 1235| 2| 1112| 0| 1111| 2221|
| 1235| 3| 1114| 0| 1111| 2221|
| 1235| 4| 2221| 1| 0| 2227|
| 1235| 5| 2225| 0| 2221| 2227|
| 1235| 6| 2226| 0| 2221| 2227|
| 1235| 7| 2227| 1| 0| 0|
+----------+-------+--------+----+-----------+----------+

Can be solved in two steps:
get events with "FLAG" == 1 and valid range for this event;
join 1. with input, by range.
Some column renaming included for visibility, can be shortened:
val window = Window.partitionBy("SERVICE_ID").orderBy("COUNTER").rowsBetween(Window.currentRow, 1)
val eventRangeDF = input.where($"FLAG" === 1)
.withColumn("RANGE_END", max($"COUNTER").over(window))
.withColumnRenamed("COUNTER", "RANGE_START")
.select("SERVICE_ID", "EVENT_ID", "RANGE_START", "RANGE_END")
eventRangeDF.show(false)
val result = input.where($"FLAG" === 0).as("i").join(eventRangeDF.as("e"),
expr("e.SERVICE_ID=i.SERVICE_ID And i.COUNTER>e.RANGE_START and i.COUNTER<e.RANGE_END"))
.select($"i.SERVICE_ID", $"i.COUNTER", $"i.EVENT_ID", $"i.FLAG", $"e.EVENT_ID".alias("first_value"))
// include FLAG=1
.union(input.where($"FLAG" === 1).select($"SERVICE_ID", $"COUNTER", $"EVENT_ID", $"FLAG", lit(0).alias("first_value")))
result.sort("COUNTER").show(false)
Output:
+----------+--------+-----------+---------+
|SERVICE_ID|EVENT_ID|RANGE_START|RANGE_END|
+----------+--------+-----------+---------+
|1235 |1111 |1 |4 |
|1235 |2221 |4 |7 |
|1235 |2227 |7 |7 |
+----------+--------+-----------+---------+
+----------+-------+--------+----+-----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|
+----------+-------+--------+----+-----------+
|1235 |1 |1111 |1 |0 |
|1235 |2 |1112 |0 |1111 |
|1235 |3 |1114 |0 |1111 |
|1235 |4 |2221 |1 |0 |
|1235 |5 |2225 |0 |2221 |
|1235 |6 |2226 |0 |2221 |
|1235 |7 |2227 |1 |0 |
+----------+-------+--------+----+-----------+

Related

Selecting rows by data corresponding to other rows of the same dataframe

I'm struggling in selecting the rows of my dataframe. The selection is depedening on the data inside the same dataframe.
My dataset looks something like this:
from pyspark.sql.session import SparkSession
sc = SparkSession.builder.getOrCreate()
columns = ['Id', 'ActorId', 'EventId', 'Time']
vals = [(3, 3, 'START', '2020-06-22'),
(4, 3, 'END', '2020-06-24'),
(5, 3, 'OTHER', '2019-01-15'),
(6, 3, 'OTHER', '2020-07-24'),
(7, 3, 'OTHER', '2020-06-23'),
(8, 4, 'START', '2018-01-15'),
(9, 4, 'END', '2019-01-14'),
(10, 4, 'OTHER', '2018-11-14')]
events = sc.createDataFrame(vals,columns)
events.show()
Which results in:
+---+-------+-------+----------+
| Id|ActorId|EventId| Time|
+---+-------+-------+----------+
| 3| 3| START|2020-06-22|
| 4| 3| END|2020-06-24|
| 5| 3| OTHER|2019-01-15|
| 6| 3| OTHER|2020-07-24|
| 7| 3| OTHER|2020-06-23|
| 8| 4| START|2018-01-15|
| 9| 4| END|2019-01-14|
| 10| 4| OTHER|2018-11-14|
+---+-------+-------+----------+
(Bear in mind, that this is just an example -> an extract of the data)
I want to find all rows with EventId==OTHER, where time is not between the START and END Events of the same ActorId.
The result should look like:
+---+-------+-------+----------+
| Id|ActorId|EventID| Time|
+---+-------+-------+----------+
| 5| 3| OTHER|2019-01-15|
| 6| 3| OTHER|2020-07-24|
+---+-------+-------+----------+
Thank you for your help!!!
This will solve your problem - There is only 1 assumption in the below code that START and END in the eventId colum will always appear in the 1st and 2nd line in each group.
_w = W.partitionBy('ActorId').orderBy('ActorId')
events = events.withColumn('start_date', F.first('Time').over(_w))
events = events.withColumn('row_num', F.row_number().over(_w))
events = events.withColumn('end_date', F.when(F.col('row_num') == F.lit('2'), F.col('Time')))
events = events.withColumn('end_date', F.coalesce(F.when(F.col('row_num') == F.lit('2'), F.col('Time')), F.min('end_date').over(_w)))
events = events.withColumn('passed_col', F.when(
(
((F.col('Time').cast(T.TimestampType()) > F.col('start_date').cast(T.TimestampType())) & (F.col('Time').cast(T.TimestampType()) > F.col('end_date').cast(T.TimestampType()))) |
(
(F.col('Time').cast(T.TimestampType()) < F.col('start_date').cast(T.TimestampType()))
& (F.col('Time').cast(T.TimestampType()) < F.col('end_date').cast(T.TimestampType())))),F.lit("Passed")))
events = events.select('Id', 'ActorId', 'EventId', 'Time', 'passed_col')
events.show()
+---+-------+-------+----------+----------+
| Id|ActorId|EventId| Time|passed_col|
+---+-------+-------+----------+----------+
| 3| 3| START|2020-06-22| null|
| 4| 3| END|2020-06-24| null|
| 5| 3| OTHER|2019-01-15| Passed|
| 6| 3| OTHER|2020-07-24| Passed|
| 7| 3| OTHER|2020-06-23| null|
| 8| 4| START|2018-01-15| null|
| 9| 4| END|2019-01-14| null|
| 10| 4| OTHER|2018-11-14| null|
+---+-------+-------+----------+----------+
Final Answer post filtering ---
events = events.filter(F.col('passed_col') == F.lit('Passed')).select('Id', 'ActorId', 'EventId', 'Time', 'passed_col')
events.show()
+---+-------+-------+----------+----------+
| Id|ActorId|EventId| Time|passed_col|
+---+-------+-------+----------+----------+
| 5| 3| OTHER|2019-01-15| Passed|
| 6| 3| OTHER|2020-07-24| Passed|
+---+-------+-------+----------+----------+
val res = vals
.filter('EventId.equalTo("OTHER"))
.filter('ActorId.equalTo(3))
.filter(!'Time.between("2020-06-01","2020-06-25"))
res.show(false)
// +---+-------+-------+----------+
// |Id |ActorId|EventId|Time |
// +---+-------+-------+----------+
// |5 |3 |OTHER |2019-01-15|
// |6 |3 |OTHER |2020-07-24|
// +---+-------+-------+----------+
or
val res = vals
.filter('EventId.equalTo("OTHER"))
.filter(!'Time.between("2018-01-01","2018-12-31"))
.filter(!'Time.between("2020-06-01","2020-06-25"))

Fill null or empty with next Row value with spark

Is there a way to replace null values in spark data frame with next row not null value. There is additional row_count column added for windows partitioning and ordering. More specifically, I'd like to achieve the following result:
+---------+-----------+ +---------+--------+
| row_count | id| |row_count | id|
+---------+-----------+ +------+-----------+
| 1| null| | 1| 109|
| 2| 109| | 2| 109|
| 3| null| | 3| 108|
| 4| null| | 4| 108|
| 5| 108| => | 5| 108|
| 6| null| | 6| 110|
| 7| 110| | 7| 110|
| 8| null| | 8| null|
| 9| null| | 9| null|
| 10| null| | 10| null|
+---------+-----------+ +---------+--------+
I tried with below code, It is not giving proper result.
val ss = dataframe.select($"*", sum(when(dataframe("id").isNull||dataframe("id") === "", 1).otherwise(0)).over(Window.orderBy($"row_count")) as "value")
val window1=Window.partitionBy($"value").orderBy("id").rowsBetween(0, Long.MaxValue)
val selectList=ss.withColumn("id_fill_from_below",last("id").over(window1)).drop($"row_count").drop($"value")
Here is a approach
Filter the non nulls (dfNonNulls)
Filter the nulls (dfNulls)
Find the right value for null id, using join and Window function
Fill the null dataframe (dfNullFills)
union dfNonNulls and dfNullFills
data.csv
row_count,id
1,
2,109
3,
4,
5,108
6,
7,110
8,
9,
10,
var df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
var dfNulls = df.filter(
$"id".isNull
).withColumnRenamed(
"row_count","row_count_nulls"
).withColumnRenamed(
"id","id_nulls"
)
val dfNonNulls = df.filter(
$"id".isNotNull
).withColumnRenamed(
"row_count","row_count_values"
).withColumnRenamed(
"id","id_values"
)
dfNulls = dfNulls.join(
dfNonNulls, $"row_count_nulls" lt $"row_count_values","left"
).select(
$"id_nulls",$"id_values",$"row_count_nulls",$"row_count_values"
)
val window = Window.partitionBy("row_count_nulls").orderBy("row_count_values")
val dfNullFills = dfNulls.withColumn(
"rn", row_number.over(window)
).where($"rn" === 1).drop("rn").select(
$"row_count_nulls".alias("row_count"),$"id_values".alias("id"))
dfNullFills .union(dfNonNulls).orderBy($"row_count").show()
which results in
+---------+----+
|row_count| id|
+---------+----+
| 1| 109|
| 2| 109|
| 3| 108|
| 4| 108|
| 5| 108|
| 6| 110|
| 7| 110|
| 8|null|
| 9|null|
| 10|null|
+---------+----+

Copy missed data from top/bottom row col values

I have a dataframe, with index, category and few other columns. index and category never be empty/null. but other columns data comes null, When all other columns data is null then we have to copy from top/bottom row values based on cateogry.
val df = Seq(
(1,1, null, null, null ),
(2,1, null, null, null ),
(3,1, null, null, null ),
(4,1,"123.12", "124.52", "95.98" ),
(5,1, "452.12", "478.65", "1865.12" ),
(1,2,"2014.21", "147", "265"),
(2,2, "1457", "12483.00", "215.21"),
(3,2, null, null, null),
(4,2, null, null, null) ).toDF("index", "category", "col1", "col2", "col3")
scala> df.show
+-----+--------+-------+--------+-------+
|index|category| col1| col2| col3|
+-----+--------+-------+--------+-------+
| 1| 1| null| null| null|
| 2| 1| null| null| null|
| 3| 1| null| null| null|
| 4| 1| 123.12| 124.52| 95.98|
| 5| 1| 452.12| 478.65|1865.12|
| 1| 2|2014.21| 147| 265|
| 2| 2| 1457|12483.00| 215.21|
| 3| 2| null| null| null|
| 4| 2| null| null| null|
+-----+--------+-------+--------+-------+
Expecting dataframe as below
+-----+--------+-------+--------+-------+
|index|category| col1| col2| col3|
+-----+--------+-------+--------+-------+
| 1| 1| 123.12| 124.52| 95.98| // Copied from below for same category
| 2| 1| 123.12| 124.52| 95.98| // Copied from below for same category
| 3| 1| 123.12| 124.52| 95.98|
| 4| 1| 123.12| 124.52| 95.98|
| 5| 1| 452.12| 478.65|1865.12|
| 1| 2|2014.21| 147| 265|
| 2| 2| 1457|12483.00| 215.21|
| 3| 2| 1457|12483.00| 215.21| // Copied from above for same category
| 4| 2| 1457|12483.00| 215.21| // Copied from above for same category
+-----+--------+-------+--------+-------+
Update When several rows with nulls possible, advanced Windows have to be used:
val cols = Seq("col1", "col2", "col3")
val beforeWindow = Window
.partitionBy("category")
.orderBy("index")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val afterWindow = Window
.partitionBy("category")
.orderBy("index")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
val result = cols.foldLeft(df)((updated, columnName) =>
updated.withColumn(columnName,
coalesce(col(columnName),
last(columnName, ignoreNulls = true).over(beforeWindow),
first(columnName, ignoreNulls = true).over(afterWindow)
))
)
In one null case can be resolved with Window functions "lead" and "lag", and "coalesce":
val cols = Seq("col1", "col2", "col3")
val categoryWindow = Window.partitionBy("category").orderBy("index")
val result = cols.foldLeft(df)((updated, columnName) =>
updated.withColumn(columnName,
coalesce(col(columnName),
lag(col(columnName), 1).over(categoryWindow),
lead(col(columnName), 1).over(categoryWindow)
))
)
result.show(false)
Output:
+-----+--------+-------+--------+-------+
|index|category|col1 |col2 |col3 |
+-----+--------+-------+--------+-------+
|1 |1 |123.12 |124.52 |95.98 |
|2 |1 |123.12 |124.52 |95.98 |
|3 |1 |452.12 |478.65 |1865.12|
|1 |2 |2014.21|147 |265 |
|2 |2 |1457 |12483.00|215.21 |
|3 |2 |1.25 |3.45 |26.3 |
|4 |2 |1.25 |3.45 |26.3 |
+-----+--------+-------+--------+-------+

Scala/Spark drop duplicates based in other column specific value [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I want to drop duplicates with same ID that not have a specific value in other column (in this case filter by those rows that have same ID and value = 1)
Input df:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 6|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Result I want:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 1| 3|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Can be done by getting rows where values is "1", and then left join with orignal data:
val df = List(
(3, 0, 2),
(3, 1, 3),
(4, 0, 6),
(4, 1, 5),
(5, 4, 6)
).toDF("id", "value", "sorted")
val withOne = df.filter($"value" === 1)
val joinedWithOriginal = df.alias("orig").join(withOne.alias("one"), Seq("id"), "left")
val result = joinedWithOriginal
.where($"one.value".isNull || $"one.value" === $"orig.value")
.select("orig.id", "orig.value", "orig.sorted")
result.show(false)
Output:
+---+-----+------+
|id |value|sorted|
+---+-----+------+
|3 |1 |3 |
|4 |1 |5 |
|5 |4 |6 |
+---+-----+------+

Pass Distinct value of one Dataframe into another Dataframe

I want to take distinct value of column from DataFrame A and Pass that into DataFrame B's explode
function to create repeat rows (DataFrameB) for each distinct value.
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
Function
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
How to pass distinctSet values to assign_utilityId
Update
+---------+
|utilityId|
+---------+
| 101|
| 101|
| 102|
+---------+
+-----+------+--------+
|index|status|timeSlot|
+-----+------+--------+
| 0| SUN| 0|
| 0| SUN| 1|
I want to take Unique value from Dataframe 1 and create new column in dataFrame 2. Like this
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102
We don't need a udf for this. I have tried with some input,please check
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+
df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+