Difference between eventTimeTimeout and processingTimeTimeout in mapGroupsWithState - scala

What is the difference between eventTimeTimeout and processingTimeTimeout in mapGroupsWithState?
Also, is possible to make a state expire after every 10 min and if the data for that particular key arrives after 10 min the state should be maintained from the beginning?

In short:
processing-based timeouts rely on the time/clock of the machine your job is running. It is independent of any timestamps given in your data/events.
event-based timeouts rely on a timestamp column within your data that serves as the event time. In that case you need to declare this timestamp as a Watermark.
More details are available in the Scala Docs on the relevant class
GroupState:
With ProcessingTimeTimeout, the timeout duration can be set by calling GroupState.setTimeoutDuration. The timeout will occur when the clock has advanced by the set duration. Guarantees provided by this timeout with a duration of D ms are as follows:
Timeout will never be occur before the clock time has advanced by D ms
Timeout will occur eventually when there is a trigger in the query (i.e. after D ms). So there is a no strict upper bound on when the timeout would occur. For example, the trigger interval of the query will affect when the timeout actually occurs. If there is no data in the stream (for any group) for a while, then their will not be any trigger and timeout function call will not occur until there is data.
Since the processing time timeout is based on the clock time, it is affected by the variations in the system clock (i.e. time zone changes, clock skew, etc.).
With EventTimeTimeout, the user also has to specify the event time watermark in the query using Dataset.withWatermark(). With this setting, data that is older than the watermark are filtered out. The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp. You can control the timeout delay by two parameters - (i) watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering). Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
"Also, is possible to make a state expire after every 10 min and if the data for that particular key arrives after 10 min the state should be maintained from the beginning?"
This is happening automatically when using mapGroupsWithState. You just need to make sure to actually remove the state after the 10 minutes.

Related

Is there a way to specify infinite allowed lateness in Apache Beam?

I'm using fixed windows to batch data by event time in order to send it to an external API efficiently (batches of 60 seconds), accumulation mode is set to DISCARDING because it doesn't matter if late data is sent to the external API without the previous data.
Is it possible to specify an infinite allowed lateness, so late data is never discarded?
It is definitely possible, you can set allowed lateness to a very high Duration (for instance, Duration.standardDays(36500)). On the other hand , doing so would result in your state growing indefinitely, which might not be what you want. Every open window (every window ever seen) will have at least a timer called a GC timer - a timer set for the end of the window + allowed lateness. Every timer has to be kept in state and therefore, the size of your state will grow over time.
If you do not need batching based on event-time, it might be a better option to use GroupIntoBatches, which should not suffer from this problem (you don't need to set allowed lateness and the size of your state will not grow).

Does Apache Beam stateful Processing consider window lateness constraints (withAllowedLateness) for resetting state?

I'm trying to implement a valueState to filter records in my ParDo transformation. The high level flow is this:
Fixed-Window of 1hr size, with allowedLateness (10min)
The first message (for a given key) that is processed in the ParDo shall set the valueState(boolean) to true. Subsequent messages for the same key shall be dropped if corresponding valueState is set to true. (Allow only first message for a given key in every window).
The messages (that are not dropped in step 2) will be written out as output.
While testing this however, I see that, after the Fixed window time-period ends (1hr), the state is reset/lost. Ideally, the state should be available to process late records until allowedLateness period (10min is complete).
These parts are right:
Each 1 hour window expires when the watermark reaches the end of the hour plus 10 minutes.
For a given window, the state is cleaned up after the window expires.
Here are the parts that I have corrections
State is never reset.
Elements with timestamps in different windows are processed totally independently. Many windows may be receiving data at the same time. Each hour window happened after another, when the data was generated. But it is not processed after the other.
Allowed lateness will not cause elements from a later window to be processed using the state from the prior window. It will simply allow the state to stay longer and the elements to not be dropped.

KStreamWindowAggregate 2.0.1 vs 2.5.0: skipping records instead of processing

I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.

Kafka Connect fetch.max.wait.ms & fetch.min.bytes combined not honored?

I'm creating a custom SinkConnector using Kafka Connect (2.3.0) that needs to be optimized for throughput rather than latency. Ideally, what I want is:
Batches of ~ 20 megabytes or 100k records whatever comes first, but if message rate is low, process at least every minute (avoid small batches, but minimum MySinkTask.put() rate to be every minute).
This is what I set for consumer settings in an attempt to accomplish it:
consumer.max.poll.records=100000
consumer.fetch.max.bytes=20971520
consumer.fetch.max.wait.ms=60000
consumer.max.poll.interval.ms=120000
consumer.fetch.min.bytes=1048576
I needs this fetch.min.bytes setting, or else MySinkTask.put() is called for multiple times per second despite the other settings...?
Now, what I observe in a low-rate situation is that MySinkTask.put() is called with 0 records multiple times and several minutes pass by, until fetch.min.bytes is reached, and then I get them all at once.
I fail to understand so far:
Why fetch.max.wait.ms=60000 is not pushing downwards from the consumer to the put() call of my connector? Shouldn't that have precedence over fetch.min.bytes?
What setting controls the ~ 2x per second call to MySinkTask.put() if fetch.min.bytes=1 (default)? I don't understand why it does that, even the verbose output of the Connect runtime settings don't show any interval below multiples of seconds.
I've double-checked the log output, and the lines INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: as printed by the Connect Runtime are showing the expected values as I pass with the consumer. prefixed values.
The "process at least every interval" part seems not possible, as the fetch.min.bytes consumer setting takes precedence and Connect does not allow you to dynamically adjust the ConsumerConfig while the Task is running. :-(
Work-around for now is batching in the Task manually; set fetch.min.bytes to 1 (yikes), buffer records in the Task on put() calls, and flush when necessary. This is not very ideal as it infers some overhead for the Connector which I hoped to avoid.
The logic how Connect does a ~ 2x per second batching from its consumer's poll to SinkTask.put() remains a mystery to me, but it's better than being called for every message.

kafka Streams session windows

Hello I am working on kafka session window with inactive time 5 mins. I want some kind of feedback when inactive time is reached and session is drooped for the key.
lets assume I have
(A,1)
record where 'A' is the key. now if i don't get any 'A' key record in 5 mins the session is dropped.
I want to do some operation on end of session lets say (value)*2 for that session. is there any way I can achieve this using Kafka Stream API
Kafka Streams does not drop a session after the gap-time passed. Instead, if will create a new session if another record with the same key arrives after the gap-time passed and maintain both session in parallel. This allows to handle out-of-order data. It could even happen, that two session get merged if an out-of-order data falls into a gap and "connects" both sessions with each other.
Sessions are maintained for 1 day by default. You can change this via SessionWindows#until() method. If a session expires it will be dropped silently. There is no notification. You also need to consider config parameter window.store.change.log.additional.retention.ms:
The default retention setting is Windows#maintainMs() + 1 day. You can override this setting by specifying StreamsConfig.WINDOW_STORE_CHANGE_LOG_ADDITIONAL_RETENTION_MS_CONFIG in the StreamsConfig.
Thus, you want to do react if time passed, you should look into punctuations that allow you to register regular callbacks (some kind of timer) either based on "even time progress" or wall-clock time. This allows you to react if a session is not update for a certain period of time and you think it's "completed".