Spark sliding window is not triggering when there is no data from the data stream - scala

I have a simple spark sql with time window of 5 minutes, trigger policy of every minute:
val withTime = eventStreams(0).selectExpr("*", "cast(cast(parsed.time as long)/1000 as timestamp) as event_time")
val momentumDataAggQuery = withTime
.selectExpr("parsed.symbol", "parsed.bid", "parsed.ask", "event_time")
.withWatermark("event_time", "1 minutes")
.groupBy(col("symbol"), window(col("event_time"), "5 minutes", "60 seconds")) // change to 60 minutes
.agg(first("bid", true).as("first_bid"), first("ask").as("first_ask"), last("bid").as("last_bid"), last("ask").as("last_ask"))
val momentumDataQuery = momentumDataAggQuery
.selectExpr("window.start", "window.end", "ln(((last_bid + last_ask)/2)/((first_bid + first_ask)/2)) as momentum", "symbol")
When there is data from the stream, it gets triggered every minute to calculate 'momentum' but stop when there is not data point. I expect it will continue using old data to update every minute even there is not enough data point.
Consider the example in the following table
In the 1st window, there is only one data point so log return is zero.
In the 2nd window, there is only two data points so it takes log(97.5625/97.4625), where 97.5625 was received at 11:53 and 97.4625 was received at 11:52:10, within the time window 12:19 <> 12:54...it went on to calculate the log return when there was sufficient data point.
However, when there was NO more data point after 15:56:12, say, for the window 12:54 <> 12:59, i expect it would take ln(97.8625/97.6625) where the input were generated at 11:56:12 and 11:54:11 respectively. However it's not the case, the red box were never generated.
Is there something wrong with my spark sql?

Related

kafka-streams sliding agg window discard out-of-order record belongs to window without grace

I have next error, actually i don't understand it.
o.a.k.s.k.i.KStreamSlidingWindowAggregate - Skipping record for expired window. topic=[...] partition=[0] offset=[16880] timestamp=[1662556875000] window=[1662542475000,1662556875000] expiration=[1662556942000] streamTime=[1662556942000]
streamTime=[1662556942000]
timestamp=[1662556875000]
streamTime-timestamp = 67s
window size is 4hour.
grace period is 0
Why was record skipped and i didn't get a output message? it belongs to window. Yes record out-of-order
Update:
After read more about kafka-streams, i understand that on each message it creates two window:
(message time - window) and this window include message.
(message time + window) and this window exclude message.
Window 1 is expired. Window 2 don't. that's why i dind't see out message.
But logically it's wrong, message belong to window but i havn't a out message.
Example
sliding window time diff = 10, grace = 0
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 1) -> no out message;
send message (time = 5, key = 1) -> no out message;
last message belongs to window (stream-time - window-time)
------ restart stream -------
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 2) -> 2 message
In Kafka Streams sliding windows are event based. A new window is created each time a new record enters or drops from the window. It is defined by the record timestamp and a fixed duration.
Each record creates a window [record.timestamp - duration, record.timestamp] and
Each dropped record creates a window [record.timestamp + 1ms, record.timestamp + 1ms + duration].
(Be aware that other stream processing frameworks use a totally different definition of 'sliding windows')
The record is not included in the window when
stream-time > window-end + grace-period
(https://kafka.apache.org/27/javadoc/org/apache/kafka/streams/kstream/SlidingWindows.html)
For your initial example, the grace-period is zero and your window ends(at record timestamp) after the stream-time; thus the record is not included in the window.
For the second example, I am not sure. My guess is the records with key=1 are expired because the stream-time(10) has exceeded the record times(4,5)(and grace period=0). For the records with key=2, one window is created for the record with timestamp=10 and an update of the same window is emitted, because the record satisfies the condition above. However, no additional windows are created for the out-of-order record, because the grace period is zero.

Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

Apache Beam - sliding windows for kinesis stream

I am trying to do a sliding window of 1 hr(3600 secs TimeWindowSize) and 5 secs(TimeWindowSamplingFrequency) with kinesis stream processed events,
but I am receiving the processed events in every 5 secs and its not doing the sliding window of 1 hr to give me the one hour result of the events transform i want.
As per my understand , it should wait and process the 1 hour events coming in from kinesis stream and then give me an output after 1 hr.
following is the sample code i used
pipeline.apply(
KinesisIO.read()
.withStreamName(options.getEnrichedSnowplowEventsStreamName())
.withAWSClientsProvider(new DefaultAWSClientsProvider())
.withInitialPositionInStream(InitialPositionInStream.LATEST))
.apply(MapElements.into(TypeDescriptors.strings())
.via(record -> new String(record.getDataAsBytes())))
.apply(ParseSnowplowEvents.fromStrings())
.apply(a userdefined ParDo transform which gives an op of
PCollection<Class> objects )
.apply(Window
.into(SlidingWindows
.of(
Duration.standardSeconds(
3600))
.every(Duration.standardSeconds(
5))
)).apply(
a userdefined transform with ParDo which gives me the o/p of PCollection<KV<Integer, Double>>>)
.apply(PrintValue.andPassOn());
PrintValue.andPassOn() userdefined transform prints the data for me , but i am expecting the result PCollection<KV<Integer,Double>> at the end of one hour sliding window , instead it prints out at every 5 secs the KV pairs
2018-06-17T13:11:29.999Z - KV{101, 5.0}
2018-06-17T13:11:34.999Z - KV{102, 0.4}
2018-06-17T13:11:39.999Z
KV{104, 0.5}
It is printing as per your sampling frequency. Change it to one hour and it should work as expected.

In Spark structured streaming how do I output complete aggregations to an external source like a REST service

The task I am trying to perform is to aggregate the count of values from a dimension (field) in a DataFrame, perform some statistics like average, max, min, etc then output the aggregates to an external system by making an API call. I am using a watermark of say 30 seconds with a window size of 10 seconds. I made these sizes small to make it easier for me to test and debug the system.
The only method I have found for making API calls is to use a ForeachWriter. My problem is that the ForeachWriter executes at the partition level and only produces an aggregate per partition. So far I haven't found a way to get the rolled up aggregates other than to coalesce to 1 which is a way to slow for my streaming application.
I have found that if I use the file based sink such as the Parquet writer to HDFS that the code produces real aggregations. It also performs very well. What I really need is to achieve this same result but calling an API rather than writing to a file system.
Does anyone know how to do this?
I have tried this with Spark 2.2.2 and Spark 2.3 and get the same behavior.
Here is a simplified code fragment to illustrate what I am trying to do:
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()
I answered my own question. I found that repartitioning by the window start time did the trick. It shuffles the data so that all rows with the same group and windowStart time are on the same executor. The code below produces a file for each group window interval. It also performs quite well. I don't have exact numbers but it produces aggregates in less time than the window interval of 10 seconds.
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.repartition($"windowStart") // <-------- this line produces the desired result
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()

What is the difference between Apache Spark compute and slice?

I trying to make unit test on DStream.
I put data in my stream with a mutable queue ssc.queueStream(red)
I set the ManualClock to 0
Start my streaming context.
Advance my ManualClock to batchDuration milis
When i'm doing a
stream.slice(Time(0), Time(clock.getTimeMillis())).map(_.collect().toList)
I got a result.
when I do
for (time <- 0L to stream.slideDuration.milliseconds + 10) {
println("time "+ time + " " +stream.compute(Time(time)).map(_.collect().toList))
}
None of them contain a result event the stream.compute(Time(clock.getTimeMillis()))
So what is the difference between this two functions without considerings the parameters differences?
Compute will return an RDD only if the provided time is a correct time in a sliding window i.e it's the zero time + a multiple of the slide duration.
Slice will align both the from and to times to slide duration and compute for each of them.
In slide you provide time interval and from time interval as long as it is valid we generate Seq[Time]
def to(that: Time, interval: Duration): Seq[Time] = {
(this.milliseconds) to (that.milliseconds) by (interval.milliseconds) map (new Time(_))
}
and then we "compute" for each instance of Seq[Time]
alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>
if (time >= zeroTime) getOrCompute(time) else None
}
as oppose to compute we only compute for the instance of Time that we pass the compute method...