How do I consume Kafka-Messages older than x minutes, but all messages on restart? - apache-kafka

I need some grace period before consuming the kafka message.
My approach is to use a hopping window.
e.g. If I want to consume the message after 5 minutes, the hopping window would be 6 minutes and will advance by 1 minute.
Then I'll use a filter to get data older than 5 minutes (there's also a timestamp in the message itself). Hence I will process data from minute 0 to minute 1. Then the hopping window jumps 1 minute forward and I process data from minute 1 to minute 2 and so on.
However I need to consume all messages when starting the application and not just the last 6 minutes.
I'm also open for other suggestions, regarding the 5 minute grace period.

I've made wrong assumptions here. All the data in the topic will be consumed, no matter how old it is.
e.g. It's 12:10 now and we start the Kafka-Stream.
The data in the topic, we want to consume, was pushed at 12:00 and we have a window of 6 minutes.
I was expecting everything to be consumed from 12:04 to 12:10 (6 minutes) and everything ago would be lost.
But the 12:00 data will be consumed anyway, it just falls into an older window.

Related

Kafka delay event processing based on data

I am wondering if there is a way to delay processing of events in a kafka stream based on the timestamp and some event data. for instance, say there are three events in the stream: event1, event2 and event3. all were produced to the stream at the same time and based on some data in the event I determine I need to process event1 after 10 seconds has passed in the timestamp, event2 in 60 seconds and event3 in 15 seconds. Is there a way to achieve this behavior without pausing the consumer? so after 10, seconds I could process event1, after 15 seconds process event3 and then after 60 seconds process event2. I have seen some answers about pausing but it seems that it would pause for 10 seconds, process event1, pause 60 seconds, process event2, etc. Any input would be greatly appreciated!
I believe it wouldn't be possible without pausing the container.
We did have similar use case, however, we wanted to wait till specified time and then start processing the messages. We ended up implementing the same using pause/resume feature.
Having a BackOff might be helpful, but the varying wait time (10, 60, 15) would make things difficult.
I am wondering whether storing data in DB view and a scheduler would be a good choice for your case.

Kafka consumer configuration to fetch at interval

I am new to kafka and trying to understand various configuration properties I need to set for my requirement as below.
I have a kafka consumer which is expected to fetch 'n' records at a time and after successfully processing them, another fetch should happen.
Example: If my consumer fetches 10 records at a time and every record takes 5 seconds to complete its processing, then after 50 seconds another fetch request should get executed and so on.
Considering the above example, Could anyone let me know what should be the values for the below configs ?
Below is my current configuration. After processing 10 records, it doesn't wait for minute as I configured. It keeps on fetching and polling.
fetchMaxBytes=50000 //approx size for 10 records
fetchWaitMaxMs=60000 //wait for a minute before a next fetch
requestTimeoutMs= //default value
heartbeatIntervalMs=1000 //acknowledgement to avoid rebalancing
maxPollIntervalMs=60000 //assuming the processing takes one minute
maxPollRecords=10 //we need 10 records to be polled at once
sessionTimeoutMs= //default value
I am using camel-kafka component to implement this.
It would be really great if someone could help me with this. Thanks Much.

Deduplication using Kafka-Streams

I want to deduplication in my kafka-streams application which uses state-store and using this very good example:
https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java
I have few questions about this example.
As I correctly understand, this example briefly do this:
Message comes into input topic
Look at the store, if it does not exist, write to state-store and return
if it does exist drop the record, so the deduplication is applied.
But in the code example there is a time window size that you can determine. Also, retention time for the messages in the state-store. You can also check the record is in the store or not by giving timestamp timeFrom + timeTo
final long eventTime = context.timestamp();
final WindowStoreIterator<String> timeIterator = store.fetch(
key,
eventTime - leftDurationMs,
eventTime + rightDurationMs
);
What is the actual purpose for the timeTo and timeFrom ? I am not sure why I am checking the next time interval because I am checking the future messages that did not come to my topic yet ?
My second question does this time interval related and should HIT the previous time window ?
If I am able to search the time interval by giving timeTo and timeFrom, why time window size is important ?
If I give the window Size 12 hours, am I able to guarantee that I am deduplicated messages for 12 hours ?
I think like this:
First message comes with key "A" in the first minute of the application start-up, after 11 hours, the message with a key "A" comes again. Can I catch this duplicated message by giving enough time interval like eventTime - 12hours ?
Thanks for any ideas !
TimeWindow size decides how long you wants the "duplication" runs, no duplication forever or just during 5 minutes. Kafka has to store these records. A large timewindow may consume a large resource of your server.
TimeFrom and TimeTo, cause your record(event) may arrive/process late in kafka, so the event-time of the record is 1 minute ago, not now. Kafka is process an "old" record, and that's it needs to take care of records which are not that old, relatively "future" records to the "old" one.

Why does Gatling still sends requests when scenario injection is on nothingFor?

So I have the following scenario:
setUp(scenario.inject(
nothingFor(30 seconds), // 1
rampUsers(10) during (30 seconds),
nothingFor(1 minute),
rampUsers(20) during (30 seconds)
).protocols(httpconf)).maxDuration(3 minutes)
I expected this scenario to start by doing nothing for 30 seconds, ramping up 10 users over 30 seconds, do nothing(pause) for a minute and finish by ramping up 20 users for 30 seconds.
But what I got is a 30 second pause, ramp up 10 users over 30 seconds, steady state of 10 users for a minute and then an additional ramp up of 20 users. (I ended up running 30 users)
What am I missing here?
The injection profiles only specify when users start a scenario, not how long they're active for - that will be determined by how long it takes for a user to finish the scenario. So when you ramp 10 users over 30 seconds one user will start the scenario every 3 seconds, but they keep running until they finish (however long that is). I'm guessing your scenario takes more than a couple of minutes for a user to complete.

Designs for counting occurrences of events in streaming processing?

The following discussion is in the context of Apache Flink:
Imagine that we have a keyedStream whose key is its id and event time is its timestamp, if we want to calculate how many events arrived within 10 minutes for each event.
The problems need to be solved are:
How to design the window ?
We can create a window of 10 minutes after each event arrives, but this mean that for each event, there will be a delay of 10 minutes because the wait for the window of 10 minutes.
We can create a window of 10 minutes which takes the timestamp of each event as the maximum timestamp in this window, which means that we don't need to wait for 10 minutes, because we take the last 10 minutes of elements before the element arrives. But this kind of window is not easy to define, as far as I know.
How to deal with memory or other resource issues ? Even we succeed to create a window, maybe the kind of ids of events are diverse, so many window like this, how the system keep their states in the memory ? There is a big possibility of stakoverflow of memory.
Maybe there are some problems that I don't mention here, or maybe there are some good solutions except window(i.e. Patterns). If you have a good solutions, please give me a clue, thank you.
You could do this with a GlobalWindow and a Trigger than fires on every event and an Evictor that removes events that are more than 10 minutes old before counting the remaining events. (A naive implementation could easily perform very poorly, however.)
Yes, this may require keeping a lot of state -- you'll be keeping every event from the past 10 minutes (well, you only need to store the timestamp from each event). If you setup the RocksDB state backend then Flink will spill to disk if need be, but with some obvious performance penalty. Probably better to use a cluster large enough to hold 10 minutes of traffic in memory. Even at one million events per second, each with a 32-bit timestamp, that's only 2.4GB in 10 minutes (1 million events per second x 600 seconds x 4 bytes per event) -- doesn't seem like a problem at all.