Spark Streaming Scala: save to text file only when sliding window is finished - scala

I'm developing a spark streaming app in scala and now that i'm finished my first version I want to improve performance.
My spark app currently is printing some important alerts of mine on every batch which means i've got new text files being generated in the range of seconds whereas i'd prefer if the computations are performed but writing to file occurs only when my sliding window expires which is in the range of 10mins.
An example of the rdd of interest follows:
val s = events.flatMap(_.split("\n")) //split block into lines of single json events
.map(toMyObject) //convert raw json to MyObject
.filter(checkCondition) //filter events based on condition
.map(x => (x._1,1L)) //count alerts based on area
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(window_length), Seconds(sliding_interval), 2) //count alerts per area
.repartition(1)
.saveAsTextFiles("alerts")

As we discussed in the comments: to implement non-overlapping window slide duration should be the same as window duration.
I.e. in example above with window duration 10 minutes, if slide duration is set to 10 minutes as well - it will produce file once per 10 minutes including calculations on all data within those 10 minutes.

Related

Inconsistent count after window lead function, and filter

Edit 2:
I've reported this as an issue to spark developers, I will post status here when I get some.
I have a problem that been bothering me for quite some time now.
Imagine you have a dataframe with several milions of records, with these columns:
df1:
start(timestamp)
user_id(int)
type(string)
I need to define duration between two rows, and filter on that duration and type.
I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times.
If NULL (last row for example), add next midnight as stop.
Data is stored in ORC file (tried with Parquet format, no difference)
This only happens with multiple executors cluster nodes, for example AWS EMR cluster or local docker cluster setup.
If I run it on single instance (local on laptop), I get consistent results every time.
Spark version is 3.0.1, both in AWS and local and docker setup.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("user_id").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), date_add(col("start"), 1))
val df2 = df1.
withColumn("end", ts_lead).
withColumn("duration", col("end").cast("long")-col("start").cast("long"))
df2.where("type='B' and duration>4").count()
Every time I run this last count, I get different results.
For example:
run 1: 19359949
run 2: 19359964
If I run every filter separately, everything is OK and I get consistent results.
But If I combine them, inconsistent.
I tried to filter to separate dataframe, first duration then type and vice versa, no joy there also.
I know that I can cache or checkpoint datframe, but it's very large dataset and I have similar calculations multiple times, so I can't really spare time and disk space for checkpoints and cache.
Is this a bug in spark, or am I missing something?
Edit:
I have created sample code with dummy random data, so anyone can try to reproduce.
Since this sample use random numbers, it's necessary to write dataset after generation and re-read it.
I user for loop to generate set because when I tried to generate 25.000.000 rows in one pass, I got out of memory error.
I saved it to s3://bucket , here it's masked with your-bucket
import org.apache.spark.sql.expressions.Window
val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
users(scala.util.Random.nextInt(7))
})
val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})
val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*1000000,(a*1000000)+1000000).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")
x.write.mode("append").orc("s3://your-bucket/random.orc")
}
val w = Window.partitionBy("user").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), lit(30000000))
val fox2 = spark.read.orc("s3://your-bucket/random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))
// repeated executions of this line returns different results for count
fox2.where("type='TypeA' and duration>4").count()
My results for three consecutive runs of last line were:
run 1: 2551259
run 2: 2550756
run 3: 2551279
Every run different count
I have reproduced your issue locally.
As far as I understand, the issue is that you are filtering by duration in this sentence:
fox2.where("type='TypeA' and duration>4").count()
and duration is generated randomly. I understand that you are using a seed, but if you parallelise that, you do not know which random value will be added to each id.
For example, if 4 generated numbers were 21, 14, 5, 17, and the ids were 1, 2, 3, 4, the start column sometimes could be:
1 + 21
2 + 14
3 + 5
4 + 17
and sometimes could be:
1 + 21
4 + 14
3 + 5
2 + 17
this will lead to different start values, and hence different duration values, ultimately leading to changes in the final filter and count because order in dataframes is not guaranteed when running in parallel.

Why does join use rows that were sent after watermark of 20 seconds?

I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.
After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.
Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

KSQL Hopping Window : accessing only oldest subwindow

I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.

Bad Perf with Sliding Windows in Flink

I use this code to perform my test (Flink Quick Start):
val text = env.socketTextStream("localhost", port, '\n')
// parse the data, group it, window it, and aggregate the counts
val windowCounts = text
.flatMap { w => w.split("\\s") }
.map { w => WordWithCount(w, 1) }
.keyBy("word")
.timeWindow(Time.minute(15))
.sum("count")
With this Code I have more than 65 000 input / seconde
If I change
timeWindow(Time.minute(15))
By
timeWindow(Time.minutes(15), Time.seconds(1))
I have less than 2 500 input / seconde
Is there any way to have better Performance with Sliding Windows ?
With a 15-minute tumbling window, each incoming event is assigned to a single window, whereas with a 15-minute sliding window with a one second slide, each incoming event is copied into 15 * 60 = 900 windows. This obviously has a performance impact.
Depending on your application requirements, you might be able to compute what you need with less overhead by using ProcessFunction, or by implementing custom window logic. For example, you could pre-aggregate into 900 one second windows, and then have a second layer of windowing that incrementally adjusts the 15-minute results by subtracting the expiring second's contribution to the total and adds in the most recent second's worth.

Operation on a sliding window over streaming in Scala using reduceByKeyAndWindow()

I am writing a Spark streaming application, using Scala, where my goal is by reading the Twitter feed every second to calculate the most retweeted statuses in a window of 60 seconds.
What i conceptually want is to get the number of retweets of a status at the end of the sliding window and subtract it from the equivalent number at its start, in order to find the no. of retweets inside the window. The relevant line of code is:
val counts = tweets.filter(_.isRetweet).map { status =>
(status.getText(), status.getRetweetedStatus().getRetweetCount())
}.reduceByKeyAndWindow(*function*, Seconds(60), Seconds(1))
So, my question is what function should I use here to achieve the desired result, that is to get the maximum value that getRetweetCount() returns inside the window and subtract the minimum value from it.
Correct me if I'm wrong or making misassumptions here, but you are essentially checking the number of retweets for a status within your window of Seconds(60). To do this, you already have your filter which removes all non-retweeted tweets (filter(_.isRetweet)). Now, all you need to do is an aggregation of the retweeted statuses to determine their frequencies.
This can be achieved with the following:
val counts = tweets.filter(_.isRetweet).map { status =>
(status.getText(), null)
}.countByValueAndWindow(Seconds(60), Seconds(1))
Perhaps after this, you can order by value and gather the most retweeted tweets within that window.