Scala/Spark drop duplicates based in other column specific value [duplicate] - scala

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I want to drop duplicates with same ID that not have a specific value in other column (in this case filter by those rows that have same ID and value = 1)
Input df:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 6|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Result I want:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 1| 3|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+

Can be done by getting rows where values is "1", and then left join with orignal data:
val df = List(
(3, 0, 2),
(3, 1, 3),
(4, 0, 6),
(4, 1, 5),
(5, 4, 6)
).toDF("id", "value", "sorted")
val withOne = df.filter($"value" === 1)
val joinedWithOriginal = df.alias("orig").join(withOne.alias("one"), Seq("id"), "left")
val result = joinedWithOriginal
.where($"one.value".isNull || $"one.value" === $"orig.value")
.select("orig.id", "orig.value", "orig.sorted")
result.show(false)
Output:
+---+-----+------+
|id |value|sorted|
+---+-----+------+
|3 |1 |3 |
|4 |1 |5 |
|5 |4 |6 |
+---+-----+------+

Related

Spark: explode multiple columns into one

Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+

Selecting rows by data corresponding to other rows of the same dataframe

I'm struggling in selecting the rows of my dataframe. The selection is depedening on the data inside the same dataframe.
My dataset looks something like this:
from pyspark.sql.session import SparkSession
sc = SparkSession.builder.getOrCreate()
columns = ['Id', 'ActorId', 'EventId', 'Time']
vals = [(3, 3, 'START', '2020-06-22'),
(4, 3, 'END', '2020-06-24'),
(5, 3, 'OTHER', '2019-01-15'),
(6, 3, 'OTHER', '2020-07-24'),
(7, 3, 'OTHER', '2020-06-23'),
(8, 4, 'START', '2018-01-15'),
(9, 4, 'END', '2019-01-14'),
(10, 4, 'OTHER', '2018-11-14')]
events = sc.createDataFrame(vals,columns)
events.show()
Which results in:
+---+-------+-------+----------+
| Id|ActorId|EventId| Time|
+---+-------+-------+----------+
| 3| 3| START|2020-06-22|
| 4| 3| END|2020-06-24|
| 5| 3| OTHER|2019-01-15|
| 6| 3| OTHER|2020-07-24|
| 7| 3| OTHER|2020-06-23|
| 8| 4| START|2018-01-15|
| 9| 4| END|2019-01-14|
| 10| 4| OTHER|2018-11-14|
+---+-------+-------+----------+
(Bear in mind, that this is just an example -> an extract of the data)
I want to find all rows with EventId==OTHER, where time is not between the START and END Events of the same ActorId.
The result should look like:
+---+-------+-------+----------+
| Id|ActorId|EventID| Time|
+---+-------+-------+----------+
| 5| 3| OTHER|2019-01-15|
| 6| 3| OTHER|2020-07-24|
+---+-------+-------+----------+
Thank you for your help!!!
This will solve your problem - There is only 1 assumption in the below code that START and END in the eventId colum will always appear in the 1st and 2nd line in each group.
_w = W.partitionBy('ActorId').orderBy('ActorId')
events = events.withColumn('start_date', F.first('Time').over(_w))
events = events.withColumn('row_num', F.row_number().over(_w))
events = events.withColumn('end_date', F.when(F.col('row_num') == F.lit('2'), F.col('Time')))
events = events.withColumn('end_date', F.coalesce(F.when(F.col('row_num') == F.lit('2'), F.col('Time')), F.min('end_date').over(_w)))
events = events.withColumn('passed_col', F.when(
(
((F.col('Time').cast(T.TimestampType()) > F.col('start_date').cast(T.TimestampType())) & (F.col('Time').cast(T.TimestampType()) > F.col('end_date').cast(T.TimestampType()))) |
(
(F.col('Time').cast(T.TimestampType()) < F.col('start_date').cast(T.TimestampType()))
& (F.col('Time').cast(T.TimestampType()) < F.col('end_date').cast(T.TimestampType())))),F.lit("Passed")))
events = events.select('Id', 'ActorId', 'EventId', 'Time', 'passed_col')
events.show()
+---+-------+-------+----------+----------+
| Id|ActorId|EventId| Time|passed_col|
+---+-------+-------+----------+----------+
| 3| 3| START|2020-06-22| null|
| 4| 3| END|2020-06-24| null|
| 5| 3| OTHER|2019-01-15| Passed|
| 6| 3| OTHER|2020-07-24| Passed|
| 7| 3| OTHER|2020-06-23| null|
| 8| 4| START|2018-01-15| null|
| 9| 4| END|2019-01-14| null|
| 10| 4| OTHER|2018-11-14| null|
+---+-------+-------+----------+----------+
Final Answer post filtering ---
events = events.filter(F.col('passed_col') == F.lit('Passed')).select('Id', 'ActorId', 'EventId', 'Time', 'passed_col')
events.show()
+---+-------+-------+----------+----------+
| Id|ActorId|EventId| Time|passed_col|
+---+-------+-------+----------+----------+
| 5| 3| OTHER|2019-01-15| Passed|
| 6| 3| OTHER|2020-07-24| Passed|
+---+-------+-------+----------+----------+
val res = vals
.filter('EventId.equalTo("OTHER"))
.filter('ActorId.equalTo(3))
.filter(!'Time.between("2020-06-01","2020-06-25"))
res.show(false)
// +---+-------+-------+----------+
// |Id |ActorId|EventId|Time |
// +---+-------+-------+----------+
// |5 |3 |OTHER |2019-01-15|
// |6 |3 |OTHER |2020-07-24|
// +---+-------+-------+----------+
or
val res = vals
.filter('EventId.equalTo("OTHER"))
.filter(!'Time.between("2018-01-01","2018-12-31"))
.filter(!'Time.between("2020-06-01","2020-06-25"))

Copy missed data from top/bottom row col values

I have a dataframe, with index, category and few other columns. index and category never be empty/null. but other columns data comes null, When all other columns data is null then we have to copy from top/bottom row values based on cateogry.
val df = Seq(
(1,1, null, null, null ),
(2,1, null, null, null ),
(3,1, null, null, null ),
(4,1,"123.12", "124.52", "95.98" ),
(5,1, "452.12", "478.65", "1865.12" ),
(1,2,"2014.21", "147", "265"),
(2,2, "1457", "12483.00", "215.21"),
(3,2, null, null, null),
(4,2, null, null, null) ).toDF("index", "category", "col1", "col2", "col3")
scala> df.show
+-----+--------+-------+--------+-------+
|index|category| col1| col2| col3|
+-----+--------+-------+--------+-------+
| 1| 1| null| null| null|
| 2| 1| null| null| null|
| 3| 1| null| null| null|
| 4| 1| 123.12| 124.52| 95.98|
| 5| 1| 452.12| 478.65|1865.12|
| 1| 2|2014.21| 147| 265|
| 2| 2| 1457|12483.00| 215.21|
| 3| 2| null| null| null|
| 4| 2| null| null| null|
+-----+--------+-------+--------+-------+
Expecting dataframe as below
+-----+--------+-------+--------+-------+
|index|category| col1| col2| col3|
+-----+--------+-------+--------+-------+
| 1| 1| 123.12| 124.52| 95.98| // Copied from below for same category
| 2| 1| 123.12| 124.52| 95.98| // Copied from below for same category
| 3| 1| 123.12| 124.52| 95.98|
| 4| 1| 123.12| 124.52| 95.98|
| 5| 1| 452.12| 478.65|1865.12|
| 1| 2|2014.21| 147| 265|
| 2| 2| 1457|12483.00| 215.21|
| 3| 2| 1457|12483.00| 215.21| // Copied from above for same category
| 4| 2| 1457|12483.00| 215.21| // Copied from above for same category
+-----+--------+-------+--------+-------+
Update When several rows with nulls possible, advanced Windows have to be used:
val cols = Seq("col1", "col2", "col3")
val beforeWindow = Window
.partitionBy("category")
.orderBy("index")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val afterWindow = Window
.partitionBy("category")
.orderBy("index")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
val result = cols.foldLeft(df)((updated, columnName) =>
updated.withColumn(columnName,
coalesce(col(columnName),
last(columnName, ignoreNulls = true).over(beforeWindow),
first(columnName, ignoreNulls = true).over(afterWindow)
))
)
In one null case can be resolved with Window functions "lead" and "lag", and "coalesce":
val cols = Seq("col1", "col2", "col3")
val categoryWindow = Window.partitionBy("category").orderBy("index")
val result = cols.foldLeft(df)((updated, columnName) =>
updated.withColumn(columnName,
coalesce(col(columnName),
lag(col(columnName), 1).over(categoryWindow),
lead(col(columnName), 1).over(categoryWindow)
))
)
result.show(false)
Output:
+-----+--------+-------+--------+-------+
|index|category|col1 |col2 |col3 |
+-----+--------+-------+--------+-------+
|1 |1 |123.12 |124.52 |95.98 |
|2 |1 |123.12 |124.52 |95.98 |
|3 |1 |452.12 |478.65 |1865.12|
|1 |2 |2014.21|147 |265 |
|2 |2 |1457 |12483.00|215.21 |
|3 |2 |1.25 |3.45 |26.3 |
|4 |2 |1.25 |3.45 |26.3 |
+-----+--------+-------+--------+-------+

transform a feature of a spark groupedBy DataFrame

I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+
You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+

Dataframe get first and last value of corresponding column

Is it possible to get first value of the corresponding column within subgroup.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{Window, WindowSpec}
object tmp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val input = Seq(
(1235, 1, 1101, 0),
(1235, 2, 1102, 0),
(1235, 3, 1103, 1),
(1235, 4, 1104, 1),
(1235, 5, 1105, 0),
(1235, 6, 1106, 0),
(1235, 7, 1107, 1),
(1235, 8, 1108, 1),
(1235, 9, 1109, 1),
(1235, 10, 1110, 0),
(1235, 11, 1111, 0)
).toDF("SERVICE_ID", "COUNTER", "EVENT_ID", "FLAG")
lazy val window: WindowSpec = Window.partitionBy("SERVICE_ID").orderBy("COUNTER")
val firsts = input.withColumn("first_value", first("EVENT_ID", ignoreNulls = true).over(window.rangeBetween(Long.MinValue, Long.MaxValue)))
firsts.orderBy("SERVICE_ID", "COUNTER").show()
}
}
Output I want.
First (or Previous) value of column EVENT_ID based on FLAG = 1
And
Last (or Next ) value of column EVENT_ID based on FLAG = 1
partition by SERVICE_ID sorted by counter
+----------+-------+--------+----+-----------+-----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|last_value|
+----------+-------+--------+----+-----------+-----------+
| 1235| 1| 1101| 0| 0| 1103|
| 1235| 2| 1102| 0| 0| 1103|
| 1235| 3| 1103| 1| 0| 1106|
| 1235| 4| 1104| 0| 1103| 1106|
| 1235| 5| 1105| 0| 1103| 1106|
| 1235| 6| 1106| 1| 0| 1108|
| 1235| 7| 1107| 0| 1106| 1108|
| 1235| 8| 1108| 1| 0| 1109|
| 1235| 9| 1109| 1| 0| 1110|
| 1235| 10| 1110| 1| 0| 0|
| 1235| 11| 1111| 0| 1110| 0|
| 1235| 12| 1112| 0| 1110| 0|
+----------+-------+--------+----+-----------+-----------+
First the dataframe need to be formed into groups. A new group starts at each time the "TIME" column equals 1. To do this, first add a column "ID" to the dataframe:
lazy val window: WindowSpec = Window.partitionBy("SERVICE_ID").orderBy("COUNTER")
val df_flag = input.filter($"FLAG" === 1)
.withColumn("ID", row_number().over(window))
val df_other = input.filter($"FLAG" =!= 1)
.withColumn("ID", lit(0))
// Create a group for each flag event
val df = df_flag.union(df_other)
.withColumn("ID", max("ID").over(window.rowsBetween(Long.MinValue, 0)))
.cache()
df.show() gives:
+----------+-------+--------+----+---+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG| ID|
+----------+-------+--------+----+---+
| 1235| 1| 1111| 1| 1|
| 1235| 2| 1112| 0| 1|
| 1235| 3| 1114| 0| 1|
| 1235| 4| 2221| 1| 2|
| 1235| 5| 2225| 0| 2|
| 1235| 6| 2226| 0| 2|
| 1235| 7| 2227| 1| 3|
+----------+-------+--------+----+---+
Now that we have a column separating the events, we need to add the correct "EVENT_ID" (renamed "first_value") to each event. In addition to the "first_value", calculate and add a second column "last_value", which is the id of the next flagged event.
val df_event = df.filter($"FLAG" === 1)
.select("EVENT_ID", "ID", "SERVICE_ID", "COUNTER")
.withColumnRenamed("EVENT_ID", "first_value")
.withColumn("last_value", lead($"first_value",1,0).over(window))
.drop("COUNTER")
val df_final = df.join(df_event, Seq("ID", "SERVICE_ID"))
.drop("ID")
.withColumn("first_value", when($"FLAG" === 1, lit(0)).otherwise($"first_value"))
df_final.show() gives us:
+----------+-------+--------+----+-----------+----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|last_value|
+----------+-------+--------+----+-----------+----------+
| 1235| 1| 1111| 1| 0| 2221|
| 1235| 2| 1112| 0| 1111| 2221|
| 1235| 3| 1114| 0| 1111| 2221|
| 1235| 4| 2221| 1| 0| 2227|
| 1235| 5| 2225| 0| 2221| 2227|
| 1235| 6| 2226| 0| 2221| 2227|
| 1235| 7| 2227| 1| 0| 0|
+----------+-------+--------+----+-----------+----------+
Can be solved in two steps:
get events with "FLAG" == 1 and valid range for this event;
join 1. with input, by range.
Some column renaming included for visibility, can be shortened:
val window = Window.partitionBy("SERVICE_ID").orderBy("COUNTER").rowsBetween(Window.currentRow, 1)
val eventRangeDF = input.where($"FLAG" === 1)
.withColumn("RANGE_END", max($"COUNTER").over(window))
.withColumnRenamed("COUNTER", "RANGE_START")
.select("SERVICE_ID", "EVENT_ID", "RANGE_START", "RANGE_END")
eventRangeDF.show(false)
val result = input.where($"FLAG" === 0).as("i").join(eventRangeDF.as("e"),
expr("e.SERVICE_ID=i.SERVICE_ID And i.COUNTER>e.RANGE_START and i.COUNTER<e.RANGE_END"))
.select($"i.SERVICE_ID", $"i.COUNTER", $"i.EVENT_ID", $"i.FLAG", $"e.EVENT_ID".alias("first_value"))
// include FLAG=1
.union(input.where($"FLAG" === 1).select($"SERVICE_ID", $"COUNTER", $"EVENT_ID", $"FLAG", lit(0).alias("first_value")))
result.sort("COUNTER").show(false)
Output:
+----------+--------+-----------+---------+
|SERVICE_ID|EVENT_ID|RANGE_START|RANGE_END|
+----------+--------+-----------+---------+
|1235 |1111 |1 |4 |
|1235 |2221 |4 |7 |
|1235 |2227 |7 |7 |
+----------+--------+-----------+---------+
+----------+-------+--------+----+-----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|
+----------+-------+--------+----+-----------+
|1235 |1 |1111 |1 |0 |
|1235 |2 |1112 |0 |1111 |
|1235 |3 |1114 |0 |1111 |
|1235 |4 |2221 |1 |0 |
|1235 |5 |2225 |0 |2221 |
|1235 |6 |2226 |0 |2221 |
|1235 |7 |2227 |1 |0 |
+----------+-------+--------+----+-----------+