Does Apache Beam stateful Processing consider window lateness constraints (withAllowedLateness) for resetting state? - streaming

I'm trying to implement a valueState to filter records in my ParDo transformation. The high level flow is this:
Fixed-Window of 1hr size, with allowedLateness (10min)
The first message (for a given key) that is processed in the ParDo shall set the valueState(boolean) to true. Subsequent messages for the same key shall be dropped if corresponding valueState is set to true. (Allow only first message for a given key in every window).
The messages (that are not dropped in step 2) will be written out as output.
While testing this however, I see that, after the Fixed window time-period ends (1hr), the state is reset/lost. Ideally, the state should be available to process late records until allowedLateness period (10min is complete).

These parts are right:
Each 1 hour window expires when the watermark reaches the end of the hour plus 10 minutes.
For a given window, the state is cleaned up after the window expires.
Here are the parts that I have corrections
State is never reset.
Elements with timestamps in different windows are processed totally independently. Many windows may be receiving data at the same time. Each hour window happened after another, when the data was generated. But it is not processed after the other.
Allowed lateness will not cause elements from a later window to be processed using the state from the prior window. It will simply allow the state to stay longer and the elements to not be dropped.

Related

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

Is there a way to specify infinite allowed lateness in Apache Beam?

I'm using fixed windows to batch data by event time in order to send it to an external API efficiently (batches of 60 seconds), accumulation mode is set to DISCARDING because it doesn't matter if late data is sent to the external API without the previous data.
Is it possible to specify an infinite allowed lateness, so late data is never discarded?
It is definitely possible, you can set allowed lateness to a very high Duration (for instance, Duration.standardDays(36500)). On the other hand , doing so would result in your state growing indefinitely, which might not be what you want. Every open window (every window ever seen) will have at least a timer called a GC timer - a timer set for the end of the window + allowed lateness. Every timer has to be kept in state and therefore, the size of your state will grow over time.
If you do not need batching based on event-time, it might be a better option to use GroupIntoBatches, which should not suffer from this problem (you don't need to set allowed lateness and the size of your state will not grow).

Window does not assess elements from Kafka Source

I think my perception of Flink windows may be wrong, since they are not evaluated as I would expect from the documentation or the Flink book. The goal is to join a Kafka topic, which has rather static data, with a Kafka topic with constantly incoming data.
env.addSource(createKafkaConsumer())
.join(env.addSource((createKafkaConsumer()))))
.where(keySelector())
.equalTo(keySelector())
.window(TumblingProcessingTimeWindows.of(Time.hours(2)))
.apply(new RichJoinFunction<A, B>() { ... }
createKafkaConsumer() returns a FlinkKafkaConsumer
keySelector() is a placeholder for my key selector.
KafkaTopic A has 1 record, KafkaTopic B has 5. My understanding would be, that the JoinFunction is triggered 5 times (join condition is valid each time), resulting in 5 records in the sink. If a new record for topic A comes in within the 2 hours, another 5 records would be created (2x5 records). However, what comes through in the sink is rather unpredictable, I could not see a pattern. Sometimes there's nothing, sometimes the initial records, but if I send additional messages, they are not being processed by the join with prior records.
My key question:
What does even happen here? Are the records emitted after the window is done processing? I would expect a real-time output to the sink, but that would explain a lot.
Related to that:
Could I handle this problem with onElement trigger or would this make my TimeWindow obsolete? Do those two concepts exists parallel to each other, i.e. that the join window is 2 hours, but the join function + output is triggered per element? How about duplicates in that case?
Subsequently, does processing time mean the point in time, when the record is consumed from the topic? So if I e.g. setStartFromEarliest() on start, all messages which were consumed within the next two hours, were in that window?
Additional info:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); is set and I also switched to EventTime in between.
The semantics of a tumbling processing time window is that it processes all events which fall into the given timespan. In your case, it is 2 hours. Per default, the window will only output results once the 2 hours are over because it needs to know that no other events will be coming for this window.
If you want to output early results (e.g. for every incoming record), then you could specify a custom Trigger which fires on every element. See the Trigger API docs for more information about this.
Update
The window time does not start with the first element but the window starts at multiples of the window length. For example, if your window size is 2 hours, then you can only have windows [0, 2), [2, 4), ... but not [1, 3), [3, 5).

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

Flink event-time session windows with max total time

I was wondering if it's possible to create a WindowAssigner that is similar to:
EventTimeSessionWindows.withGap(Time.seconds(1L))
Except I don't want the window to keep growing in event-time on each element. I want the beginning of the window to be defined at the first element received (for that key), and end exactly 1 second later, no matter how many elements arrive in that second.
So it would probably look like this hypothetically:
EventTimeSessionWindows.withMax(Time.seconds(1L))
Thanks!
There is no built-in window for this use case.
However, you can implement this with a GlobalWindow, which collects all incoming elements, and a Trigger that registers a timer when an element is received and the window is empty, i.e., the first element or the first element after the window was purged. The window collects new elements until the timer fires. At that point, the window is evaluated and purged.