Is there an out-of-the-box mechanism in Kafka to produce a record that shall not be processed before a given timestamp regardless of the contents of the topic?
Thanks.
There isn't; Kafka consumes by offset, not by timestamps. But that doesn't prevent your consumer from handing off messages to some secondary worker "priority queue" ordered by timestamp backed by some scheduler thread that checks for the next event to be processed
Only problem is that if you process a message offset "O+1" with time "T-1" while waiting on message "O#T", and your consumer crashes and loses that in-memory queue, then you're effectively skipping messages (committing O+1 without processing O)
Depending on your application, this might be okay, since you could reproduce records again to queue them. You can also seek to specific offsets or get (approximate) offsets by a particular timestamp (which defaults to time the record is produced)
Related
is it possible to pick the packets by consumers after defined time in the packet by kafka consumer or how can we achieve this in kafka?
Found related question, but it didn't help. As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
same is possible with rabbitMQ.
If I understand the question, you would need to consume the data, deserialize it and inspect the time field. Then append to some priority queue data structure and start a background timer thread to check if events from this queue should further be processed, and not block the Kafka consumer.
The only downside to this approach is that you then need to worry about processing and committing "shorter time" events that are read by the consumer while waiting for previously consumed "longer time". Otherwise, a restart of your client will drop all events from an in memory queue and start consuming after the last committed record.
You might be able to workaround this using a persistent "outbox pattern" database table, or otherwise tracking offsets and processed records manually, and seeking past any duplicates
What happens to the long processing record when max.poll.interval.ms time exceeds will it run in the background and rebalancing will be triggered .
As per my limited understanding the kafka consumer( Spring kafkalistener) service gets halted / restarted and the records get assigned to other consumers in the group during rebalancing
You will have records left in memory being processed if the application or processing logic doesn't stop with the consumer thread.
If offsets were committed beforehand, those records would effectively be skipped after a rebalance. Otherwise, those offsets ideally shouldn't be committed post-processing since those records might be tried to be processed again, potentially resulting in data duplication, by other consumers after a rebalance.
Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.
I'm using one topic, one partition, one consumer, Kafka client version is 0.10.
I got two different results:
If I paused partition first, then to produce a message and to invoke resume method. KafkaConsumer can poll the uncommitted message successfully.
But If I produced message first and didn't commit its offset, then to pause the partition, after several seconds, to invoke the resume method. KafkaConsumer would not receive the uncommitted message. I checked it on Kafka server using kafka-consumer-groups.sh, it shows LOG-END-OFFSET minus CURRENT-OFFSET = LAG = 1.
I have been trying to figure out it for two days, I repeated such tests a lot of times, the results are always like so. I need some suggestion or someone can tell me its Kafka's original mechanism.
For your observation#2, if you restart the application, it will supply you all records from the un-committed offset, i.e. the missing record and if your consumer again does not commit, it will be sent again when application registers consumer with Kafka upon restart. It is expected.
Assuming you are using consumer.poll() which creates a hybrid-streaming interface i.e. if accumulates data coming into Kafka for the duration mentioned and provides it to the consumer for processing once the duration is finished. This continuous accumulation happens in the backend and is not dependent on whether you have committed offset or not.
KafkaConsumer
The position of the consumer gives the offset of the next record that
will be given out. It will be one larger than the highest offset the
consumer has seen in that partition. It automatically advances every
time the consumer receives messages in a call to poll(long).
I have a batch job, which populates data to Kafka topic. Every message has data and job identifier.
On the consumer side, I want to only read messages, which belong to this job. After the job has finished and all the messages consumed, consumer side has to do some post processing.
1) If this is guaranteed, that no other messages will be produced during the job, how can I understand that job has finished and all the messages, produced by the job were consumed? (taking into consideration multiple partitions and asychrony).
2) If it is NOT guaranteed, that no other messages will be produced during the job, noise can be skipped, I believe.
Thanks
I'm assuming the job_id is constant. In that case, you can put a check in your consumer to shut down if n subsequent poll returns empty records from Kafka. n will depend on your ingestion rate and consumer poll interval.
I am only talking about the first case here. Mind you, this is just an idea and I have never tried it myself
You can use endOffsets()to get the last offsets of all the partitions and then loop over all of them after every message to check if all the current offsets match the ending offsets. If all are a match, you have reached the end.