Apache Beam - sliding windows for kinesis stream - apache-beam

I am trying to do a sliding window of 1 hr(3600 secs TimeWindowSize) and 5 secs(TimeWindowSamplingFrequency) with kinesis stream processed events,
but I am receiving the processed events in every 5 secs and its not doing the sliding window of 1 hr to give me the one hour result of the events transform i want.
As per my understand , it should wait and process the 1 hour events coming in from kinesis stream and then give me an output after 1 hr.
following is the sample code i used
pipeline.apply(
KinesisIO.read()
.withStreamName(options.getEnrichedSnowplowEventsStreamName())
.withAWSClientsProvider(new DefaultAWSClientsProvider())
.withInitialPositionInStream(InitialPositionInStream.LATEST))
.apply(MapElements.into(TypeDescriptors.strings())
.via(record -> new String(record.getDataAsBytes())))
.apply(ParseSnowplowEvents.fromStrings())
.apply(a userdefined ParDo transform which gives an op of
PCollection<Class> objects )
.apply(Window
.into(SlidingWindows
.of(
Duration.standardSeconds(
3600))
.every(Duration.standardSeconds(
5))
)).apply(
a userdefined transform with ParDo which gives me the o/p of PCollection<KV<Integer, Double>>>)
.apply(PrintValue.andPassOn());
PrintValue.andPassOn() userdefined transform prints the data for me , but i am expecting the result PCollection<KV<Integer,Double>> at the end of one hour sliding window , instead it prints out at every 5 secs the KV pairs
2018-06-17T13:11:29.999Z - KV{101, 5.0}
2018-06-17T13:11:34.999Z - KV{102, 0.4}
2018-06-17T13:11:39.999Z
KV{104, 0.5}

It is printing as per your sampling frequency. Change it to one hour and it should work as expected.

Related

kafka-streams sliding agg window discard out-of-order record belongs to window without grace

I have next error, actually i don't understand it.
o.a.k.s.k.i.KStreamSlidingWindowAggregate - Skipping record for expired window. topic=[...] partition=[0] offset=[16880] timestamp=[1662556875000] window=[1662542475000,1662556875000] expiration=[1662556942000] streamTime=[1662556942000]
streamTime=[1662556942000]
timestamp=[1662556875000]
streamTime-timestamp = 67s
window size is 4hour.
grace period is 0
Why was record skipped and i didn't get a output message? it belongs to window. Yes record out-of-order
Update:
After read more about kafka-streams, i understand that on each message it creates two window:
(message time - window) and this window include message.
(message time + window) and this window exclude message.
Window 1 is expired. Window 2 don't. that's why i dind't see out message.
But logically it's wrong, message belong to window but i havn't a out message.
Example
sliding window time diff = 10, grace = 0
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 1) -> no out message;
send message (time = 5, key = 1) -> no out message;
last message belongs to window (stream-time - window-time)
------ restart stream -------
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 2) -> 2 message
In Kafka Streams sliding windows are event based. A new window is created each time a new record enters or drops from the window. It is defined by the record timestamp and a fixed duration.
Each record creates a window [record.timestamp - duration, record.timestamp] and
Each dropped record creates a window [record.timestamp + 1ms, record.timestamp + 1ms + duration].
(Be aware that other stream processing frameworks use a totally different definition of 'sliding windows')
The record is not included in the window when
stream-time > window-end + grace-period
(https://kafka.apache.org/27/javadoc/org/apache/kafka/streams/kstream/SlidingWindows.html)
For your initial example, the grace-period is zero and your window ends(at record timestamp) after the stream-time; thus the record is not included in the window.
For the second example, I am not sure. My guess is the records with key=1 are expired because the stream-time(10) has exceeded the record times(4,5)(and grace period=0). For the records with key=2, one window is created for the record with timestamp=10 and an update of the same window is emitted, because the record satisfies the condition above. However, no additional windows are created for the out-of-order record, because the grace period is zero.

Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

Kafka Streams Hopping window top N by dimension

I have a kafka stream, and I need a processor which does the following:
Uses a 45 second hopping window with 5 second advances to compute the top 5 count based on one dimension of the domain object. For example, if the stream would contain Clickstream data, I would need the top 5 urls viewed by domain name, but also windowed in a hopping window.
I've seen examples to do window counting, for example:
KStream<String, GenericRecord> pageViews = ...;
// Count page views per window, per user, with hopping windows of size 5 minutes that advance every 1 minute
KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews
.groupByKey(Grouped.with(Serdes.String(), genericAvroSerde))
.windowedBy(TimeWindows.of(Duration.ofMinutes(5).advanceBy(Duration.ofMinutes(1))))
.count()
And Top n aggregations on the MusicExample, for example:
songPlayCounts.groupBy((song, plays) ->
KeyValue.pair(TOP_FIVE_KEY,
new SongPlayCount(song.getId(), plays)),
Grouped.with(Serdes.String(), songPlayCountSerde))
.aggregate(TopFiveSongs::new,
(aggKey, value, aggregate) -> {
aggregate.add(value);
return aggregate;
},
(aggKey, value, aggregate) -> {
aggregate.remove(value);
return aggregate;
},
Materialized.<String, TopFiveSongs, KeyValueStore<Bytes, byte[]>>as(TOP_FIVE_SONGS_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(topFiveSerde)
);
I just can't seem to be able to combine the 2 - where I get both windowing and top n aggregations. Any thoughts?
In general yes, however, for non-windowed top-N aggregation the algorithm will always be an approximation (it's not possible to get an exact result, because one would need to buffer everything what is not possible for unbounded input). However, for a hopping window, you would do an exact computation.
For the windowed case case, the actual aggregation step, could just accumulate all input records per window (eg, return a List<V> or some other collection). On this result KTable you apply a mapValues() function that get the List<V> of input records per window (and key), and can compute the actual top-N result you are looking for.

Spark sliding window is not triggering when there is no data from the data stream

I have a simple spark sql with time window of 5 minutes, trigger policy of every minute:
val withTime = eventStreams(0).selectExpr("*", "cast(cast(parsed.time as long)/1000 as timestamp) as event_time")
val momentumDataAggQuery = withTime
.selectExpr("parsed.symbol", "parsed.bid", "parsed.ask", "event_time")
.withWatermark("event_time", "1 minutes")
.groupBy(col("symbol"), window(col("event_time"), "5 minutes", "60 seconds")) // change to 60 minutes
.agg(first("bid", true).as("first_bid"), first("ask").as("first_ask"), last("bid").as("last_bid"), last("ask").as("last_ask"))
val momentumDataQuery = momentumDataAggQuery
.selectExpr("window.start", "window.end", "ln(((last_bid + last_ask)/2)/((first_bid + first_ask)/2)) as momentum", "symbol")
When there is data from the stream, it gets triggered every minute to calculate 'momentum' but stop when there is not data point. I expect it will continue using old data to update every minute even there is not enough data point.
Consider the example in the following table
In the 1st window, there is only one data point so log return is zero.
In the 2nd window, there is only two data points so it takes log(97.5625/97.4625), where 97.5625 was received at 11:53 and 97.4625 was received at 11:52:10, within the time window 12:19 <> 12:54...it went on to calculate the log return when there was sufficient data point.
However, when there was NO more data point after 15:56:12, say, for the window 12:54 <> 12:59, i expect it would take ln(97.8625/97.6625) where the input were generated at 11:56:12 and 11:54:11 respectively. However it's not the case, the red box were never generated.
Is there something wrong with my spark sql?

In Spark structured streaming how do I output complete aggregations to an external source like a REST service

The task I am trying to perform is to aggregate the count of values from a dimension (field) in a DataFrame, perform some statistics like average, max, min, etc then output the aggregates to an external system by making an API call. I am using a watermark of say 30 seconds with a window size of 10 seconds. I made these sizes small to make it easier for me to test and debug the system.
The only method I have found for making API calls is to use a ForeachWriter. My problem is that the ForeachWriter executes at the partition level and only produces an aggregate per partition. So far I haven't found a way to get the rolled up aggregates other than to coalesce to 1 which is a way to slow for my streaming application.
I have found that if I use the file based sink such as the Parquet writer to HDFS that the code produces real aggregations. It also performs very well. What I really need is to achieve this same result but calling an API rather than writing to a file system.
Does anyone know how to do this?
I have tried this with Spark 2.2.2 and Spark 2.3 and get the same behavior.
Here is a simplified code fragment to illustrate what I am trying to do:
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()
I answered my own question. I found that repartitioning by the window start time did the trick. It shuffles the data so that all rows with the same group and windowStart time are on the same executor. The code below produces a file for each group window interval. It also performs quite well. I don't have exact numbers but it produces aggregates in less time than the window interval of 10 seconds.
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.repartition($"windowStart") // <-------- this line produces the desired result
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()