How to split data into series based on conditions in Apache Spark - scala

I have data in following format sorted by timestamp, each row representing an event:
+----------+--------+---------+
|event_type| data |timestamp|
+----------+--------+---------+
| A | d1 | 1 |
| B | d2 | 2 |
| C | d3 | 3 |
| C | d4 | 4 |
| C | d5 | 5 |
| A | d6 | 6 |
| A | d7 | 7 |
| B | d8 | 8 |
| C | d9 | 9 |
| B | d10 | 12 |
| C | d11 | 20 |
+----------+--------+---------+
I need to collect these events into series like so:
1. Event of type C marks the end of the series
2. If there are multiple consecutive events of type C, they fall to the same series and the last one marks the end of that series
3. Each series can span 7 days at max, even if there is no C event to end it
Please also note that there can be multiple series in a single day. In reality, timestamp column are standard UNIX timestamps, here let the numbers express days for simplicity.
So desired output would look like this:
+---------------------+--------------------------------------------------------------------+
|first_event_timestamp| events: List[(event_type, data, timestamp)] |
+---------------------+--------------------------------------------------------------------+
| 1 | List((A, d1, 1), (B, d2, 2), (C, d3, 3), (C, d4, 4), (C, d5, 5)) |
| 6 | List((A, d6, 6), (A, d7, 7), (B, d8, 8), (C, d9, 9)) |
| 12 | List((B, d10, 12)) |
| 20 | List((C, d11, 20)) |
+---------------------+--------------------------------------------------------------------+
I tried to solve this using Window functions, where I would add 2 columns like this:
1. Seed column marked event directly succeeding an event of type C using some unique id
2. SeriesId was filled by values from Seed column using last() to mark all events in one series with same id
3. I would then group the events by the SeriesId
Unfortunately, this doesn't seem possible:
+----------+--------+---------+------+-----------+
|event_type| data |timestamp| seed | series_id |
+----------+--------+---------+------+-----------+
| A | d1 | 1 | null | null |
| B | d2 | 2 | null | null |
| C | d3 | 3 | null | null |
| C | d4 | 4 | 0 | 0 |
| C | d5 | 5 | 1 | 1 |
| A | d6 | 6 | 2 | 2 |
| A | d7 | 7 | null | 2 |
| B | d8 | 8 | null | 2 |
| C | d9 | 9 | null | 2 |
| B | d10 | 12 | 3 | 3 |
| C | d11 | 20 | null | 3 |
+----------+--------+---------+------+-----------+
I don't seem to be able to test preceding row on equality using lag(), i.e. following code:
df.withColumn(
"seed",
when(
(lag($"eventType", 1) === ventType.Conversion).over(w),
typedLit(DigestUtils.sha256Hex("some fields").substring(0, 32))
)
)
throws
org.apache.spark.sql.AnalysisException: Expression '(lag(eventType#76, 1, null) = C)' not supported within a window function.
As the table shows, it fails on case where there are multiple consecutive C events and also wouldn't work for the first and last series.
I'm kinda stuck here, any help would be appreciated(using Dataframe/dataset api is prefered).

Here is the approach
Identify the start of the event series, based on conditions
Tag the record as start event
select the records of start events
get the records end date (if we order the start event records desc, then
previous start time will be current end series time)
join the original data, with above dataset
Here is udf to tag the record as "start"
//tag the starting event, based on the conditions
def tagStartEvent : (String,String,Int,Int) => String = (prevEvent:String,currEvent:String,prevTimeStamp:Int,currTimeStamp:Int)=>{
//very first event is tagged as "start"
if (prevEvent == "start")
"start"
else if ((currTimeStamp - prevTimeStamp) > 7 )
"start"
else {
prevEvent match {
case "C" =>
if (currEvent == "A")
"start"
else if (currEvent == "B")
"start"
else // if current event C
""
case _ => ""
}
}
}
val tagStartEventUdf = udf(tagStartEvent)
data.csv
event_type,data,timestamp
A,d1,1
B,d2,2
C,d3,3
C,d4,4
C,d5,5
A,d6,6
A,d7,7
B,d8,8
C,d9,9
B,d10,12
C,d11,20
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
val window = Window.partitionBy("all").orderBy("timestamp")
//tag the starting event
val dfStart =
df.withColumn("all", lit(1))
.withColumn("series_start",
tagStartEventUdf(
lag($"event_type",1, "start").over(window), df("event_type"),
lag($"timestamp",1,1).over(window),df("timestamp")))
val dfStartSeries = dfStart.filter($"series_start" === "start").select(($"timestamp").as("series_start_time"),$"all")
val window2 = Window.partitionBy("all").orderBy($"series_start_time".desc)
//get the series end times
val dfSeriesTimes = dfStartSeries.withColumn("series_end_time",lag($"series_start_time",1,null).over(window2)).drop($"all")
val dfSeries =
df.join(dfSeriesTimes).withColumn("timestamp_series",
// if series_end_time is null and timestamp >= series_start_time, then series_start_time
when(col("series_end_time").isNull && col("timestamp") >= col("series_start_time"), col("series_start_time"))
// if record greater or equal to series_start_time, and less than series_end_time, then series_start_time
.otherwise(when((col("timestamp") >= col("series_start_time") && col("timestamp") < col("series_end_time")), col("series_start_time")).otherwise(null)))
.filter($"timestamp_series".isNotNull)
dfSeries.show()

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

Identifying recurring values a column over a Window (Scala)

I have a data frame with two columns: "ID" and "Amount", each row representing a transaction of a particular ID and the transacted amount. My example uses the following DF:
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),
(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
I want to create a new column identifying whether said amount is a recurring value, i.e. occurs in any other transaction for the same ID, or not.
I have found a way to do this more generally, i.e. across the entire column "Amount", not taking into account the ID, using the following function:
def recurring_amounts(df: DataFrame, col: String) : DataFrame = {
var df_to_arr = df.select(col).rdd.map(r => r(0).asInstanceOf[Double]).collect()
var arr_to_map = df_to_arr.groupBy(identity).mapValues(_.size)
var map_to_df = arr_to_map.toSeq.toDF(col, "Count")
var df_reformat = map_to_df.withColumn("Amount", $"Amount".cast(DoubleType))
var df_out = df.join(df_reformat, Seq("Amount"))
return df_new
}
val df_output = recurring_amounts(df, "Amount")
This returns:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 3 |
| 1 | 120 | 3 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 3 |
+---+------+-----+
which I can then use to create my desired binary variable to indicate whether the amount is recurring or not (yes if > 1, no otherwise).
However, my problem is illustrated in this example by the value 120, which is recurring for ID 1 but not for ID 2. My desired output therefore is:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 2 |
| 1 | 120 | 2 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 1 |
+---+------+-----+
I've been trying to think of a way to apply a function using
.over(Window.partitionBy("ID") but not sure how to go about it. Any hints would be much appreciated.
If you are good in sql, you can write sql query for your Dataframe. The first thing that you need to do is to register your Dataframeas a table in the spark's memory. After that you can write the sql on top of the table. Note that spark is the spark session variable.
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
df.registerTempTable("transactions")
spark.sql("select *,count(*) over(partition by ID,Amount) as Count from transactions").show()
Please let me know if you have any questions.

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ?

I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful

how to get multiple rows from one row in spark scala [duplicate]

This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
+----------------------------------+
| id code1 code2 code3 code4 code5 |
+----------------------------------+
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
+----------------------------------+
I want the output like below format
+-------------+
| id code |
+-------------+
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
+-------------+
Can anyone please help me here how I will get the above output with spark and scala.
using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
OR
as defined by undefined_variable, you can just use select
df.select($"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))
df.select(col("id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it

PostgreSQL Query?

DB
| ID| VALUE | Parent | Position | lft | rgt |
|---|:------:|:-------:|:--------------:|:--------:|:--------:|
| 1 | A | | | 1 | 12 |
| 2 | B | 1 | L | 2 | 9 |
| 3 | C | 1 | R | 10 | 11 |
| 4 | D | 2 | L | 3 | 6 |
| 5 | F | 2 | R | 7 | 8 |
| 6 | G | 4 | L | 4 | 5 |
Get All Nodes directly under current Node in left side
SELECT "categories".* FROM "categories" WHERE ("categories"."position" = 'L') AND ("categories"."lft" >= 1 AND "categories"."lft" < 12) ORDER BY "categories"."lft"
output { B,D,G } incoorect!
Question !
how have Nodes directly under current Node in left and right side?
output-lft {B,D,F,G}
output-rgt {C}
It sounds like you're after something analogous to Oracle's CONNECT_BY statement, which is used to connect hierarchical data stored in a flat table.
It just so happens there's a way to do this with Postgres, using a recursive CTE.
here is the statement I came up with.
WITH RECURSIVE sub_categories AS
(
-- non-recursive term
SELECT * FROM categories WHERE position IS NOT NULL
UNION ALL
-- recursive term
SELECT c.*
FROM
categories AS c
JOIN
sub_categories AS sc
ON (c.parent = sc.id)
)
SELECT DISTINCT categories.value
FROM categories,
sub_categories
WHERE ( categories.parent = sub_categories.id
AND sub_categories.position = 'L' )
OR ( categories.parent = 1
AND categories.position = 'L' )
here is a SQL Fiddle with a working example.