Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress - apache-kafka

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,

According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

Related

kafka-streams sliding agg window discard out-of-order record belongs to window without grace

I have next error, actually i don't understand it.
o.a.k.s.k.i.KStreamSlidingWindowAggregate - Skipping record for expired window. topic=[...] partition=[0] offset=[16880] timestamp=[1662556875000] window=[1662542475000,1662556875000] expiration=[1662556942000] streamTime=[1662556942000]
streamTime=[1662556942000]
timestamp=[1662556875000]
streamTime-timestamp = 67s
window size is 4hour.
grace period is 0
Why was record skipped and i didn't get a output message? it belongs to window. Yes record out-of-order
Update:
After read more about kafka-streams, i understand that on each message it creates two window:
(message time - window) and this window include message.
(message time + window) and this window exclude message.
Window 1 is expired. Window 2 don't. that's why i dind't see out message.
But logically it's wrong, message belong to window but i havn't a out message.
Example
sliding window time diff = 10, grace = 0
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 1) -> no out message;
send message (time = 5, key = 1) -> no out message;
last message belongs to window (stream-time - window-time)
------ restart stream -------
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 2) -> 2 message
In Kafka Streams sliding windows are event based. A new window is created each time a new record enters or drops from the window. It is defined by the record timestamp and a fixed duration.
Each record creates a window [record.timestamp - duration, record.timestamp] and
Each dropped record creates a window [record.timestamp + 1ms, record.timestamp + 1ms + duration].
(Be aware that other stream processing frameworks use a totally different definition of 'sliding windows')
The record is not included in the window when
stream-time > window-end + grace-period
(https://kafka.apache.org/27/javadoc/org/apache/kafka/streams/kstream/SlidingWindows.html)
For your initial example, the grace-period is zero and your window ends(at record timestamp) after the stream-time; thus the record is not included in the window.
For the second example, I am not sure. My guess is the records with key=1 are expired because the stream-time(10) has exceeded the record times(4,5)(and grace period=0). For the records with key=2, one window is created for the record with timestamp=10 and an update of the same window is emitted, because the record satisfies the condition above. However, no additional windows are created for the out-of-order record, because the grace period is zero.

Kafka streams dropping messages during windowing and on restart

I am working on Kafka Streams application with following topology:
private final Initializer<Set<String>> eventInitializer = () -> new HashSet<>();
final StreamsBuilder streamBuilder = new StreamsBuilder();
final KStream<String, AggQuantityByPrimeValue> eventStreams = streamBuilder.stream("testTopic",
Consumed.with(Serdes.String(), **valueSerde**));
final KStream<String, Value> filteredStreams = eventStreams
.filter((key,clientRecord)->recordValidator.isAllowedByRules(clientRecord));
final KGroupedStream<Integer, Value> groupedStreams = filteredStreams.groupBy(
(key, transactionEntry) -> transactionEntry.getNodeid(),
Serialized.with(Serdes.Integer(), **valueSerde**));
/* Hopping window */
final TimeWindowedKStream<Integer, Value> windowedGroupStreams = groupedStreams
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).advanceBy(Duration.ofSeconds(25))
.grace(Duration.ofSeconds(0)));
/* Aggregating the events */
final KStream<Windowed<Integer>, Set<String>> suppressedStreams = windowedGroupStreams
.aggregate(eventInitializer, countAggregator, Materialized.as("counts-aggregate")
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())
.withName("suppress-window")
.toStream();
suppressedStreams.foreach((windowed, value) -> eventProcessor.publish(windowed.key(), value));
return new KafkaStreams(streamBuilder.build(), config.getKafkaConfigForStreams());
I am observing that intermittently few events are getting dropped during/after windowing.
For example:
All records can be seen/printed in isAllowedByRules() method, which are valid(allowed by filters) and consumed by the stream.
But when printing the events in countAggregator, I can see few events are not coming through it.
Current configurations for streams:
Properties streamsConfig = new Properties();
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG,"kafka-app-id"
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, <bootstraps-server>);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 30000);
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 5);
streamsConfig.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 10000);
streamsConfig.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 30000);
streamsConfig.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, 10485760);
streamsConfig.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, 10485760);
streamsConfig.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 10485760);
/*For window buffering across all threads*/
streamsConfig.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 52428800);
streamsConfig.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Integer().getClass().getName());
streamsConfig.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, **customSerdesForSet**);
Initially, I was using tumbling window but I found that mostly at the end of window few events were getting lost so I changed to hopping window (better to duplicate than lost). Then dropped events became zero. But today again after almost 4 days I saw few dropped events and there is one pattern among them, that they are late by almost a minute compared to other events which were produced together. But then expectation is that these late events should come in any of the future windows but that didn't happen. Correct me here if my understanding is not right.
Also as I have mentioned in the topic, on restart of streams (gracefully) I could see few events getting lost again at aggregation step though processed by isAllowedByRules() method.
I have searched a lot on stack overflow and other sites, but couldn't find the root cause of this behaviour. Is it something related to some configuration that I am missing/not correctly setting or could be due to some other reason?
From my understanding, you have a empty grace period :
/* Hopping window */
...
.grace(Duration.ofSeconds(0))
So your window is closed without permitting any late arrivals.
Then regarding your sub question :
But then expectation is that these late events should come in any of the future windows but that didn't happen. Correct me here if my understanding is not right.
Maybe you're mixing event time and processing time.
Your record will be categorized as 'late' if the timestamp of the record ( added by the producer at produce time, or by the brokers when arriving in the cluster if not set by producer) is outside your current window.
Here is a example with 2 records '*'.
Their event time (et1 and et2) fit in the window :
| window |
t1 t2
| * * |
et1 et2
But, processing time of et2 (pt2) is in fact as follows :
| window |
t1 t2
| * | *
pt1 pt2
Here the window is a slice of time between t1 and t2 (processing time)
et1 and et2 are respectively event time of the 2 records '*'.
et1 and et2 are timestamps set in the records themselves.
in this example , et1 and et2 are between t1 and t2, et2 have been received after the window closure, as your grace period is 0, it will be skipped.
Might be a explanation

Kafka - Consume until empty

I have a use-case where it is paramount to not continue until all consumer records in a KafkaConsumer have been fetched. In this use-case there will be nothing going into the pipeline. What is the proper way to assure that there absolutely positively is nothing at all left to fetch?
Kafka is designed to handle infinite streams of data, so the "consume all" means only that nobody sends any data for some period of time (1 minute), 1 hour, etc. - it's up to you.
You can use something like (pseudocode):
int emptyCount = 0;
while (true) {
records = Consumer.poll(500);
if (records.empty()) {
emptyCount++;
if (emptyCount >= 100) {
break;
}
continue;
}
emptyCount = 0;
...process records...
}
you can tune timeout in poll & number of empty cycles to reach necessary wait period.
If you are using kafka-console-consumer, you can specify timeout-ms argument to define how long it will wait until it is considered to be no more message coming.
--timeout-ms <Integer: timeout_ms> If specified, exit if no message is
available for consumption for the
specified interval.

Q: [Anylogic] Measuring production throughput rate

I would like to know how to measure the throughput rate of the production line on Anylogic.
Question: Are there any methods to measure the Time Between Departure of the agent at the sink block? >>(I will calculate the throughput rate by inverting the time between departure value.)
At the moment, I just simply calculated the throughput based on Little's law, which I use the average lead time and WIP level of the line. I am not sure that whether the throughput value based on this calculation will be equal to the inverted value of the time between departure or not?
I hope you guys could help me figure it out.
Thanks in advance!
There is a function "time()" that returns the current model time in model time units. Using this function, you may know the times when agent A and agent B left the system, and calculate the difference between these times. You can do this by writing the code like below in the "On exit" field of the "sink" block:
statistic.add(time() - TimeOfPreviousAgent);
TimeOfPreviousAgent = time();
"TimeOfPreviousAgent" is a variable of "double" type;
"statistic" is a "Statistic" element used to collect the measurements
This approach of measuring time in the process flow is described in the tutorial Bank Office.
As an alternative, you can store leaving time of each agent into a collection. Then, you will need to iterate over the samples stored in the collection to find the difference between each pair of samples.
Not sure if this will help but it stems off Tatiana's answer. In the agents state chart you can create variables TimeIn, TimeOut, and TimeInSystem. Then at the Statechart Entry Point have,
TimeIn = time();
And at the Final state have,
TimeOut = time();
TimeInSystem = TimeOut - TimeIn;
To observe these times for each individual agent you can use the following code,
System.out.println("I came in at " + TimeIn + " and exited at " TimeOut + " and spent " + TimeInSystem + " seconds in the system";
Then for statistical analysis you can calculate the min, avg, and max throughputs of all agents by creating in Main variables, TotalTime, TotalAgentsServiced, AvgServiceTime, MaxServiceTime, MinServiceTime and then add a function call it say TrackAvgTimeInSystem ... within the function add argument NextAgent with type double. In the function body have,
TotalTime += NextAgent;
TotalAgentsServiced += 1;
AverageServiceTime = TotalTime/TotalCarsServiced;
if(MinServiceTimeReported == 0)
{
MinServiceTime = NextAgent;
}
else if(NextAgent < MinServiceTime)
{
MinServiceTime = NextAgent;
}
if(NextAgent > MaxServiceTime)
{
MaxServiceTime = NextAgent;
}
Then within your agent's state charts, in the Final State call the function
get_Main().TrackAvgTimeInSystem(TimeInSystem);
This then calculates the min, max, and average throughput of all agents.

In Rx (or RxJava/RxScala), how to make an auto-resetting stateful latch map/filter for measuring in-stream elapsed time to touch a barrier?

Apologies if the question is poorly phrased, I'll do my best.
If I have a sequence of values with times as an Observable[(U,T)] where U is a value and T is a time-like type (or anything difference-able I suppose), how could I write an operator which is an auto-reset one-touch barrier, which is silent when abs(u_n - u_reset) < barrier, but spits out t_n - t_reset if the barrier is touched, at which point it also resets u_reset = u_n.
That is to say, the first value this operator receives becomes the baseline, and it emits nothing. Henceforth it monitors the values of the stream, and as soon as one of them is beyond the baseline value (above or below), it emits the elapsed time (measured by the timestamps of the events), and resets the baseline. These times then will be processed to form a high-frequency estimate of the volatility.
For reference, I am trying to write a volatility estimator outlined in http://www.amazon.com/Volatility-Trading-CD-ROM-Wiley/dp/0470181990 , where rather than measuring the standard deviation (deviations at regular homogeneous times), you repeatedly measure the time taken to breach a barrier for some fixed barrier amount.
Specifically, could this be written using existing operators? I'm a bit stuck on how the state would be reset, though maybe I need to make two nested operators, one which is one-shot and another which keeps creating that one-shot... I know it could be done by writing one by hand, but then I need to write my own publisher etc etc.
Thanks!
I don't fully understand the algorithm and your variables in the example, but you can use flatMap with some heap-state and return empty() or just() as needed:
int[] var1 = { 0 };
source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
If you need a per-sequence state because of multiple consumers, you can defer the whole thing:
Observable.defer(() -> {
int[] var1 = { 0 };
return source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
}).subscribe(...);