Let's consider a topic with multiple partitions and messages written in event-time order without any particular partitioning scheme. Kafka Streams application does some transformations on these messages, then groups by some key, and then aggregates messages by an event-time window with the given grace period.
Each task could process incoming messages at a different speed (e.g., because running on servers with different performance characteristics). This means that after groupBy shuffle, event-time ordering will not be preserved between messages in the same partition of the internal topic when they originate from different tasks. After a while, this event-time skew could become larger than the grace period, which would lead to dropping messages originating from the lagging task.
Increasing the grace period doesn't seem like a valid option because it would delay emitting the final aggregation result. Apache Flink handles this by emitting the lowest watermark on partitions merge.
Should it be a real concern, especially when processing large amounts of historical data, or do I miss something? Does Kafka Streams offer a solution to deal with this scenario?
UPDATE My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle.
Consider this code snippet:
stream
.mapValues(...)
.groupBy(...)
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofSeconds(10)))
.aggregate(...)
I assume mapValues() operation could be slow for some tasks for whatever reason, and because of that tasks do process messages at a different pace. When a shuffle happens at the aggregate() operator, task 0 could have processed messages up to time t while task 1 is still at t-skew, but messages from both tasks end up interleaved in a single partition of the internal topic (corresponding to the grouping key).
My concern is that when skew is large enough (more than 10 seconds in my example), messages from the lagging task 1 will be dropped.
Basically, a task/processor maintains a stream-time which is defined as the highest timestamp of any record already polled. This stream-time is then used for different purpose in Kafka Streams (e.g: Punctator, Windowded Aggregation, etc).
[Windowed Aggregation]
As you mentioned, the stream-time is used to determine if a record should be accepted, i.e record_accepted = end_window_time(current record) + grace_period > observed stream_time.
As you described it, if several tasks run in parallel to shuffle messages based on a grouping key, and some tasks are slower than others (or some partitions are offline) this will create out-of-order messages. Unfortunately, I'm afraid that the only way to deal with that is to increase the grace_period.
This is actually the eternal trade-off between Availability and Consistency.
[Behaviour for KafkaStream and KafkaStream/KTable Join
When you are perfoming a join operation with Kafka Streams, an internal Task is assigned to the "same" partition over multiple co-partitioned Topics. For example the Task 0 will be assigned to Topic1-Partition0 and TopicB-Partition0.
The fetched records are buffered per partition into internal queues that are managed by Tasks. So, each queue contains all records for a single partition waiting for processing.
Then, records are polled one by one from queues and processed by the topology instance. But, this is the record from the non-empty queue having the lowest timestamp which is returned from the polled.
In addition, if a queue is empty, the task may become idle during a period of time so that no more records are polled from queue. You can actually configure the maximum amount of time a Task will stay idle can be defined with the stream config :max.task.idle.ms
This mecanism allows synchronizing co-localized partitions. Bu, default the max.task.idle.ms is set to 0. This means a Task will never wait for more data from a partition which may lead to records being filtered because the stream-time will potentially increase more quickly.
Related
Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.
Here we go: I got quite complicated topology of various joins, aggregations, filters, maps, etc. By defaul the NUM_STREAM_THREADS_CONFIG parameter equals to 1 and that's completely determenistic by definition - thus, partition's total ordering (that is guaranteed by Kafka itself) preserved.
Will total ordering be preserved once I set NUM_STREAM_THREADS_CONFIG to 2 or more then that?
Does it depend upon special topology? I've checked the docs and went throught the threading model section, yet did not find an answer.
Data is always processed in per-partition offset order, even if you set num.stream.threads to a larger value.
In Kafka Streams, sub-topologies are translated into tasks (based on input topic partitions) and tasks process records of their partitions in offset order. The number of tasks limits the number of threads you can keep busy (similar to the maximum number of consumers in a consumer group). If you configure more threads than available tasks, some threads just stay idle.
If a task processed data from multiple topics/partitions, there is no strict ordering guarantee for data of different partitions. Kafka Streams will take the record timestamps into account thought, and process records with smaller timestamp first.
I know that if we have multiple partitions and almost the same number of consumers in the consumer group then the processing will speed up. If we want to maintain the ordering of the event and process each event as and when it is received, how can we achieve this with multiple partitions and consumers.
In my use case processing the events in order is extremely critical or else the system will fall apart. I wanted to use multiple partitions to increase parallelism but somehow 'get them in order'.
Simplest answer: you can't
Once you split data to partitions, you can't guarantee order in consumption (even with single consumer).
Isn't there any logic in which you can shard the data into partitions, so that messages that has to be consumed in order would end up in same partition?
I have a simple dataflow pipeline (job id 2018-05-15_06_17_40-8591349846083543299) with 1 min worker and 7 max workers that does the following:
Consume from 4 Kafka topics using KafkaIO. Each topic is represented differently and is a separate PCollection
Perform transformation on each PCollection to output a standard representation PCollection.
Merge the 4 PCollection using Flatten.pCollections
Window into hourly with the following trigger:
Repeatedly
.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(40000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))
)
)
.orFinally(AfterWatermark.pastEndOfWindow())
Write these events to GCS using AvroIO windowed writes with 14 shards.
When the pipeline is launched initially everything is fine, but after several hours later, the System Lag increases dramatically in the AvroIO:GroupIntoShards step.
Upon further investigation one of the topics is lagging behind many hours (this topic has the greatest events per second when compared to the other 3). Looking at the logs I see Closing idle reader for S12-000000000000000a which is understandable. However, the topic's consumer group offsets for the 36 partitions is in a state where for some partitions the offset is very low, but some are very high. The log-end-offset is more or less evenly distributed and the records we are producing are around the same size.
Questions:
If the System Lag is high in a certain step, does that prevent the Kafka consumers from consuming?
Any possible reason for the uneven distribution in Kafka offsets?
The PCollection's that is merged have different traffic patterns, some low and some high. Would adding the AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5) trigger effectively start writing to GCS for each (window, shard) after 5 minutes when an event is first seen in a window?
Updating the pipeline using the same code / configuration brings it back into a normal state where the consumed rate is much higher (due to the lag before the restart) than the produced rate.
Addressing 3 questions raised (I left a comment about the specific job):
No, system lag does not prevent Kafka from consuming.
In general if there is lots of work for downstream stages ready to be processed, that can delay upstream work from starting. But that is not KafkaIO specific.
Does not seem to be the case here. In general, assuming there is no skew among Kafka partitions themselves, heavy skew in Beam processing can cause readers assigned to workers that are doing more work than others.
I think yes. I think firstElementInPane() applies to element from any of the sources.
I have N topics as input, each with messages added in ascending delivery date order. Topics can vary widely in message count, date range, partitioning strategy. But I know that all partitions for every topic will independently be in date order.
I want to merge all N topics priority-queue style into a new single topic T. T also has whatever partition count and strategy it wants since the only requirement is that each individual partition of T is still in date order on its own. I then feed T to partition-aware consumers which will consume them and idle between due dates since I want each message to be delivered on or closely thereafter its delivery date. This whole pipeline can stream forever.
I expect tuning issues with exactly how partitions amongst all the N input topics and the single T output topic are distributed, and advice which affects that specifically is welcome but right now I'm mainly interested in the overall viability of doing this at all using only Kafka topics, not a RDB or Key-value store. So some extra I/O moving messages between non-optimal topic partitions is okay.
Is this doable with the 0.9 consumer where I can control knowing which partitions are assigned to each consumer, so I can let auto-rebalancing occur while endlessly peek/merge-to-T/commit-offset the oldest message on each actual partition? I must have partition awareness to have a chance of this working.
Due to needing shared merge state (the last date added to T), is it better to stick with multiple partition-aware consumers in a single process, parallel processes or multiple servers given where that state will need to be? I favor keeping the state onboard in shared memory not networked in ZK or whatever. On a restart I can get it once and maintain it while running if on a single machine.
Am I overlooking any Kafka features that would make what I describe easier or more efficient, like some atomic message move between topics? I know I am going against the grain of its design and this scenario is similar to TS.