Kafka Streams Hopping window top N by dimension - apache-kafka

I have a kafka stream, and I need a processor which does the following:
Uses a 45 second hopping window with 5 second advances to compute the top 5 count based on one dimension of the domain object. For example, if the stream would contain Clickstream data, I would need the top 5 urls viewed by domain name, but also windowed in a hopping window.
I've seen examples to do window counting, for example:
KStream<String, GenericRecord> pageViews = ...;
// Count page views per window, per user, with hopping windows of size 5 minutes that advance every 1 minute
KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews
.groupByKey(Grouped.with(Serdes.String(), genericAvroSerde))
.windowedBy(TimeWindows.of(Duration.ofMinutes(5).advanceBy(Duration.ofMinutes(1))))
.count()
And Top n aggregations on the MusicExample, for example:
songPlayCounts.groupBy((song, plays) ->
KeyValue.pair(TOP_FIVE_KEY,
new SongPlayCount(song.getId(), plays)),
Grouped.with(Serdes.String(), songPlayCountSerde))
.aggregate(TopFiveSongs::new,
(aggKey, value, aggregate) -> {
aggregate.add(value);
return aggregate;
},
(aggKey, value, aggregate) -> {
aggregate.remove(value);
return aggregate;
},
Materialized.<String, TopFiveSongs, KeyValueStore<Bytes, byte[]>>as(TOP_FIVE_SONGS_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(topFiveSerde)
);
I just can't seem to be able to combine the 2 - where I get both windowing and top n aggregations. Any thoughts?

In general yes, however, for non-windowed top-N aggregation the algorithm will always be an approximation (it's not possible to get an exact result, because one would need to buffer everything what is not possible for unbounded input). However, for a hopping window, you would do an exact computation.
For the windowed case case, the actual aggregation step, could just accumulate all input records per window (eg, return a List<V> or some other collection). On this result KTable you apply a mapValues() function that get the List<V> of input records per window (and key), and can compute the actual top-N result you are looking for.

Related

How to aggregate events in flink stream before merging with current state by reduce function?

My events are like: case class Event(user: User, stats: Map[StatType, Int])
Every event contains +1 or -1 values in it.
I have my current pipeline that works fine but produces new event for every change of statistics.
eventsStream
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
I'd like to aggregate these increments in a time window before merging them with the current state. So I want the same rolling reduce but with a time window.
Current simple rolling reduce:
500 – last reduced value
+1
-1
+1
Emitted events: 501, 500, 501
Rolling reduce with a window:
500 – last reduced value
v-- window
+1
-1
+1
^-- window
Emitted events: 501
I've tried naive solution to put time window just before reduce but after reading the docs I see that reduce now has different behavior.
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
It seems that I should make keyed stream and reduce it after reducing my time window:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
Is it the right pipeline to solve a problem?
There's probably different options, but one would be to implement a WindowFunction and then run apply after the windowing:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.apply(new MyWindowFunction)
(WindowFuntion takes type parameters for the type of the input value, the type of the output value and the type of the key.)
There's an example of that in here. Let me copy the relevant snippet:
/** User-defined WindowFunction to compute the average temperature of SensorReadings */
class TemperatureAverager extends WindowFunction[SensorReading, SensorReading, String, TimeWindow] {
/** apply() is invoked once for each window */
override def apply(
sensorId: String,
window: TimeWindow,
vals: Iterable[SensorReading],
out: Collector[SensorReading]): Unit = {
// compute the average temperature
val (cnt, sum) = vals.foldLeft((0, 0.0))((c, r) => (c._1 + 1, c._2 + r.temperature))
val avgTemp = sum / cnt
// emit a SensorReading with the average temperature
out.collect(SensorReading(sensorId, window.getEnd, avgTemp))
}
I don't know how your data looks so I can't attempt a full answer, but that should serve as inspiration.
Yes, your proposed pipeline will have the desired effect. The window will reduce together the 2-minute batches. The results of those batches will flow into the final reduce, which will produce an updated result after each of its inputs (which are the window results).

Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

Kafka streams dropping messages during windowing and on restart

I am working on Kafka Streams application with following topology:
private final Initializer<Set<String>> eventInitializer = () -> new HashSet<>();
final StreamsBuilder streamBuilder = new StreamsBuilder();
final KStream<String, AggQuantityByPrimeValue> eventStreams = streamBuilder.stream("testTopic",
Consumed.with(Serdes.String(), **valueSerde**));
final KStream<String, Value> filteredStreams = eventStreams
.filter((key,clientRecord)->recordValidator.isAllowedByRules(clientRecord));
final KGroupedStream<Integer, Value> groupedStreams = filteredStreams.groupBy(
(key, transactionEntry) -> transactionEntry.getNodeid(),
Serialized.with(Serdes.Integer(), **valueSerde**));
/* Hopping window */
final TimeWindowedKStream<Integer, Value> windowedGroupStreams = groupedStreams
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).advanceBy(Duration.ofSeconds(25))
.grace(Duration.ofSeconds(0)));
/* Aggregating the events */
final KStream<Windowed<Integer>, Set<String>> suppressedStreams = windowedGroupStreams
.aggregate(eventInitializer, countAggregator, Materialized.as("counts-aggregate")
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())
.withName("suppress-window")
.toStream();
suppressedStreams.foreach((windowed, value) -> eventProcessor.publish(windowed.key(), value));
return new KafkaStreams(streamBuilder.build(), config.getKafkaConfigForStreams());
I am observing that intermittently few events are getting dropped during/after windowing.
For example:
All records can be seen/printed in isAllowedByRules() method, which are valid(allowed by filters) and consumed by the stream.
But when printing the events in countAggregator, I can see few events are not coming through it.
Current configurations for streams:
Properties streamsConfig = new Properties();
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG,"kafka-app-id"
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, <bootstraps-server>);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 30000);
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 5);
streamsConfig.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 10000);
streamsConfig.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 30000);
streamsConfig.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, 10485760);
streamsConfig.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, 10485760);
streamsConfig.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 10485760);
/*For window buffering across all threads*/
streamsConfig.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 52428800);
streamsConfig.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Integer().getClass().getName());
streamsConfig.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, **customSerdesForSet**);
Initially, I was using tumbling window but I found that mostly at the end of window few events were getting lost so I changed to hopping window (better to duplicate than lost). Then dropped events became zero. But today again after almost 4 days I saw few dropped events and there is one pattern among them, that they are late by almost a minute compared to other events which were produced together. But then expectation is that these late events should come in any of the future windows but that didn't happen. Correct me here if my understanding is not right.
Also as I have mentioned in the topic, on restart of streams (gracefully) I could see few events getting lost again at aggregation step though processed by isAllowedByRules() method.
I have searched a lot on stack overflow and other sites, but couldn't find the root cause of this behaviour. Is it something related to some configuration that I am missing/not correctly setting or could be due to some other reason?
From my understanding, you have a empty grace period :
/* Hopping window */
...
.grace(Duration.ofSeconds(0))
So your window is closed without permitting any late arrivals.
Then regarding your sub question :
But then expectation is that these late events should come in any of the future windows but that didn't happen. Correct me here if my understanding is not right.
Maybe you're mixing event time and processing time.
Your record will be categorized as 'late' if the timestamp of the record ( added by the producer at produce time, or by the brokers when arriving in the cluster if not set by producer) is outside your current window.
Here is a example with 2 records '*'.
Their event time (et1 and et2) fit in the window :
| window |
t1 t2
| * * |
et1 et2
But, processing time of et2 (pt2) is in fact as follows :
| window |
t1 t2
| * | *
pt1 pt2
Here the window is a slice of time between t1 and t2 (processing time)
et1 and et2 are respectively event time of the 2 records '*'.
et1 and et2 are timestamps set in the records themselves.
in this example , et1 and et2 are between t1 and t2, et2 have been received after the window closure, as your grace period is 0, it will be skipped.
Might be a explanation

Aggregation (summation) of number of events from different Kafka topics

My application has three topics that receive some events belonging to users:
Event Type A -> Topic A
Event Type B -> Topic B
Event Type C -> Topic C
This would be an example of the flow of messages:
Message(user 1 - event A - 2020-01-03)
Message(user 2 - event A - 2020-01-03)
Message(user 1 - event C - 2020-01-20)
Message(user 1 - event B - 2020-01-22)
I want to be able to generate reports with the total number of events per user per month, aggregating all the events from the three topics, something like:
User 1 - 2020-01 -> 3 total events
User 2 - 2020-01 -> 1 total events
Having three KStreams (one per topic), how can I perform this addition per month to have the summation of all the events from three different topics? Can you show the code for this?
Because you are only interested in counting, the simplest way would be to just keep the user-id as key, and some dummy value for each KStream, merge all three streams and do a windowed-count afterwards (note that calendar based windows are not supported out-of-the-box; you could use a 31 day window as an approximation or build your own customized windows):
// just map to dummy empty string (note, that `null` would not work
KStream<UserId, String> streamA = builder.stream("topic-A").mapValues(v -> "");
KStream<UserId, String> streamB = builder.stream("topic-B").mapValues(v -> "");
KStream<UserId, String> streamC = builder.stream("topic-C").mapValues(v -> "");
streamA.merge(streamB).merge(streamC).groupByKey().windowBy(...).count();
You might also be interested in the suppress() operator.

How to make spark partition sticky, i.e. stay with node?

I am trying to use Spark Streaming 1.2.0. At some point, I grouped streaming data by key and then applied some operation on them.
The following is a segment of the test code:
...
JavaPairDStream<Integer, Iterable<Integer>> grouped = mapped.groupByKey();
JavaPairDStream<Integer, Integer> results = grouped.mapToPair(
new PairFunction<Tuple2<Integer, Iterable<Integer>>, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Iterable<Integer>> tp) throws Exception {
TaskContext tc = TaskContext.get();
String ip = InetAddress.getLocalHost().getHostAddress();
int key = tp._1();
System.out.println(ip + ": Partition: " + tc.partitionId() + "\tKey: " + key);
return new Tuple2<>(key, 1);
}
});
results.print();
mapped is an JavaPairDStream wrapping a dummy receiver that stores an array of integers every second.
I ran this app on a cluster with two slaves, each has 2 cores.
When I checked out the printout, it seems that partitions were not assigned to nodes permanently (or in a "sticky" fashion). They moved between the two nodes frequently. This creates a problem for me.
In my real application, I need to load fairly large amount of geo data per partition. These geo data will be used to process the data in the streams. I can only afford to load part of the geo data set per partition. If the partition moves between nodes, I will have to move the geo data too, which can be very expensive.
Is there a way to make the partitions sticky, i.e. partition 0,1,2,3 stay with node 0, partition 4,5,6,7 stay with node 1?
I have tried setting spark.locality.wait to a large number, say, 1000000. And it did not work.
Thanks.
I found a workaround.
I can make my auxiliary data a RDD. Partition it and cache it.
Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. E.g.
...
JavaPairRDD<Integer, GeoData> geoRDD =
geoRDD1.partitionBy(new HashPartitioner(num)).cache();
Later, do this
JavaPairRDD<Integer, Integer> someOtherRDD = ...
JavaPairRDD<Integer, Tuple2<Iterator<GeoData>>, Iterator<Integer>>> grp =
geoRDD.cogroup(someOtherRDD);
Then, you can use foreach on the cogroupped rdd to process the input data with geo data.