Real time windowed counters using kafka - apache-kafka

I would like to have a real-time efficient system to perform real-time windowed counters of event. For example, number of clicks per country in the last 30 minutes. My idea is the following, using Kafka and Cassandra:
when a click event e happens at time t, an increment is sent to an accumulator (I would like to use Cassandra counters for this purpose); at the same time, this event should generate a decrement event at time t+30min, which basically marks the exit of the event e from the window of interest. This decrement will be then stored in the accumulator at time t+30min. Querying the accumulator at any time will then give the current correct counter value
I am not sure how to achieve this using Kafa. I thought about 2 approaches
clicks are sent to topic C; a consumer will read from C and produce events in a topic Cdelayed with message (Tdelay, key); another consumer will read from Cdelayed and check the content of the first message read: if the timestamp in the message is equal or after the current time, then the message is read and the decrement is sent to the accumulator
a main consumer will poll from C and when performing the first read will trigger a delayed consumer which will sleep 30 minutes before start reading from earliest timestamp; the latter will be responsible for decrements, the former for increments
Which of the 2 solutions is the best? Am I using Kafka API in the correct way if I 1) block a consumer or 2) delay a consumer?

Related

Handler method for collecting skipped records for expired window in kafka streams

I have a java application (spring-kafka) with Kafka streams. It streams on a kafka topic using a tumbling window. I have implemented a custom timestamp extractor to keep windowing based on timing information in event. My input records are not in temporal order. Some late records older than grace period are getting skipped in aggregation. This is as per the skip implementation in class org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate (https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L121) . A log entry is recorded since windowEnd is greater than windowCloseTime. Now I need to move this skipped record to another topic for late data processing. Is there any handler using which I can collect these skipped records?

Delaying all messages by 30 minutes in Kafka Streams

We have a use case where we need to write all messages of topic a into topic b, but with a delay of 30 minutes for each message. Why, you ask? Because time is of critical importance for this stream of data, so paying customers get the real-time feed, for freeloaders, we offer the delayed stream.
I guess it would be relatively easy to do in a KafkaConsumer poll() loop, by comparing system time and message time (using an ordered message time like producer time or ingestion time) and then pause()ing the partitions in question and resume()ing them after the appropriate time interval of up to 30 minutes(, all the while continuing to poll() to avoid getting failed over).
As the data, though delayed, still needs to be delivered in a streaming fashion, the delay of the ingestion times of all messages in topic a and b should be as close to 30 minutes as possible.
But is this also easily possible in Kafka Streams, so that we can use its built-in exactly-once guarantees? I wonder if "it's ok to call Thread.sleep() in Kafka Streams also applies to longer sleeps of up to 30 minutes? (Of course we don't want a partition rebalance to occur because Kafka thinks something's wrong with our process)
Assuming we get this to work, is there a way to get proper lag monitoring for this? If we just delay messages, I would think the consumer group lag would always amount to at least 30 minutes worth of messages. So is it possible to have the lag monitor count only unprocessed messages older than 30 minutes?
(2. is of less importance for us than getting 1. to work)
Edit: https://stackoverflow.com/a/59261274/709537 proposes a solution to a somewhat related problem, but that involves state stores and thus looks more complicated than would seem necessary for our simple (?) "delay all messages by x minutes" task.
Regarding 2., I assume we will have to roll our own lag monitoring for this.
A simple way to do something related - measuring latency instead of the number of lagging messages - would be to periodically and for every partition
get the first unread input message
currentLatency = max(0, ingestionTime(firstUnreadMessage) - 30min)
If we wanted to monitor the number of lagging messages, something a little more involved would need to be done:
read input messages backwards, until there is one with ingestionTime + 30min <= systemTime
the number of those messages would be the lag
However reading messages backwards is not exactly one of Kafka's core competencies... A clever binary search style could be devised to get the exact value. However, no-one really wants to know whether the message lag is 43123 or 40513, what they want to know is the order of magnitude. That will keep the number of seeks down to a handful (per partition), and no binary search style back and forth would be necessary. The output could e.g. be
lag < 10
lag < 100
lag < 1000
lag < 10000
...

Kafka paritioning with respect to timestamp to process only messages in parition of current hour parition

I am working on a messaging system where messages are generated by user which have a time parameter.
I have a consumer which runs a job every hour and looks for messages that have time with current time and sends these messages.
I want to parition the topic "message" based on this time/timestamp in groups of one hour per parition, so that my consumer will only process one parition every hour instead of going through all messages every hour.
as of now i have a producer that will produce message in key value pair where key is the time with hour rounded off.
I have two questions :-
I can create 2-3 partitions by specifying in kafka settings but how can I have multiple patitions for every hour slot?
Should I rather create new topic for every hour and ask the consumer to only listen to the topic of the current hour?
example I am creating a topic named say "2020-7-23-10" which will contain all messages that need to be delivered between 10-11AM 23 July 2020. so I can just subscribe to this topic and process them.
or I can create a topic named "messages" and partition it based on the time and force my consumer to only process a particular patition.

Understanding max.task.idle.ms in Kafka Stream for a KStream-KTable join

I need help understanding Kafka stream behavior when max.task.idle.ms is used in Kafka 2.2.
I have a KStream-KTable join where the KStream has been re-keyed:
KStream stream1 = builder.stream("topic1", Consumed.with(myTimeExtractor));
KStream stream2 = builder.stream("topic2", Consumed.with(myTimeExtractor));
KTable table = stream1
.groupByKey()
.aggregate(myInitializer, myAggregator, Materialized.as("myStore"))
stream2.selectKey((k,v)->v)
.through("rekeyedTopic")
.join(table, myValueJoiner)
.to("enrichedTopic");
All topics have 10 partitions and for testing, I've set max.task.idle.ms to 2 minutes. myTimeExtractor updates the event time of messages only if they are labelled "snapshot": Each snapshot message in stream1 gets its event time set to some constant T, messages in stream2 get their event time set to T+1.
There are 200 messages present in each of topic1 and in topic2 when I call KafkaStreams#start, all labelled "snapshot" and no message is added thereafter. I can see that within a second or so both myStore and rekeyedTopic get filled up. Since the event time of the messages in the table is lower than the event time of the messages in the stream my understanding (from reading https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization) is that I should see the result of the join (in enrichedTopic) shortly after myStore and rekeyedTopic are filled up. In fact I should be able to fill up rekeyedTopic first and as long as myStore gets filled up less than 2 minutes after that, the join should still produce the expected result.
This is not what happens. What happens is that myStore and rekeyedTopic get filled up within the first second or so, then nothing happens for 2 minutes and only then enrichedTopic gets filled with the expected messages.
I don't understand why there is a pause of 2 minutes before the enrichedTopic gets filled since everything is "ready" long before. What I am missing?
based on the documentation where it states:
max.task.idle.ms - Maximum amount of time a stream task will stay idle when not
all of its partition buffers contain records, to avoid potential out-of-order
record processing across multiple input streams.
I would say it's possibly due to some of the partition buffers NOT containing records so it's basically waiting to avoid out of order processing up to the defined time you have configured for the property.

Is it possible consumer Kafka messages after arrival?

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.