Kafka delay event processing based on data - apache-kafka

I am wondering if there is a way to delay processing of events in a kafka stream based on the timestamp and some event data. for instance, say there are three events in the stream: event1, event2 and event3. all were produced to the stream at the same time and based on some data in the event I determine I need to process event1 after 10 seconds has passed in the timestamp, event2 in 60 seconds and event3 in 15 seconds. Is there a way to achieve this behavior without pausing the consumer? so after 10, seconds I could process event1, after 15 seconds process event3 and then after 60 seconds process event2. I have seen some answers about pausing but it seems that it would pause for 10 seconds, process event1, pause 60 seconds, process event2, etc. Any input would be greatly appreciated!

I believe it wouldn't be possible without pausing the container.
We did have similar use case, however, we wanted to wait till specified time and then start processing the messages. We ended up implementing the same using pause/resume feature.
Having a BackOff might be helpful, but the varying wait time (10, 60, 15) would make things difficult.
I am wondering whether storing data in DB view and a scheduler would be a good choice for your case.

Related

Does flink hold history of closed event time windows with watermark?

I have flink job that aggregates data using keyed tumbling windows with event time and watermark.
My question is does flink holds a state of his already closed windows?
Otherwise, I have no other explanation why an event that belongs to a window that never opened before will open a window and not dropped it immediately.
given that our windows are for 1 hour and forBoundedOutOfOrderness is 10 minutes
Lets have an example :
event1 = ("2022-01-01T08:25:00Z") => window fired
event2 = ("2022-01-01T09:25:00Z") => window created but not fired as expected
event3 = ("2022-01-01T05:25:00Z") => will be aggregate with event 4 instead of dropped (why?)
event4 = ("2022-01-01T05:40:00Z") => will be aggregate with event 3 instead of dropped (why?)
val stream = env
.fromSource(
kafkaSource,
WatermarkStrategy
.forBoundedOutOfOrderness[(String, EnrichedProcess, KafkaHeaders)](Duration.ofMinutes(outOfOrdernessMinutes)) //Watermark max time for late events
.withIdleness(Duration.ofSeconds(idleness))
.withTimestampAssigner(new SerializableTimestampAssigner[(String, EnrichedProcess, KafkaHeaders)] {
override def extractTimestamp(element: (String, EnrichedProcess, KafkaHeaders), recordTimestamp: Long)
: Long = {
logger.info(
LogMessage(
element._3.orgId,
s"Received incoming EnrichedProcess update_time: ${element._3.updateTime}, process time: ${recordTimestamp.asDate}",
element._3.flowId
)
)
element._3.updateTime.asEpoch
}
}),
s"Source - $kConsumeTopic"
)
stream
.keyBy(element => (element._2.orgId -> element._2.procUid))
.window(TumblingEventTimeWindows.of(Time.hours(tumblingWindowHours), Time.minutes(windowStartingOffsetMinutes)))
.reduce(new ReduceFunc)
.name("Aggregated EnrichedProcess")
.sinkTo(kafkaConnector.createKafkaSink(kProducerServers, kProduceTopic))
.name(s"Sink -> $kProduceTopic")
edited:
The way I'm testing this out is Integration Tests with docker compose. I'm generating an event to Kafka => consumed by Flink job & sink to Kafka => checking the content of kafka.
When I put Sleep of 30 sec between sending the event3 and event4 are dropped. This is the behaviour I was expecting.
val producer = new Producer(producerTopic)
val consumer = new Consumer(consumerTopic, groupId)
producer.send(event1)
producer.send(event2)
Thread.sleep(30000)
producer.send(event3)
Thread.sleep(30000)
producer.send(event4)
val received: Iterable[(String, EnrichedProcess)] = consumer.receive[EnrichedProcess]()
But even more curious now is why when I put Sleep of 10 sec instead of 30, I recieve only the first situation (The watermark was supposed to be updated already(defualt of periodic watermark generator is 200ms)
Executive summary:
Non-determinism in event-time-based logic with Flink comes from mixing processing time with event time -- as happens with periodic watermark generators and idleness detection. Only if you never have late events or idle sources can you be sure of deterministic results.
More details:
While you would expect
event3 = ("2022-01-01T05:25:00Z")
to be late, it will only truly be late if a large enough watermark has managed to arrive first. With the forBoundedOutOfOrderness strategy there's no guarantee of that -- this is a periodic watermark generator that produces watermarks every 200 msec. So it could be that a watermark based on the timestamp of event2 is produced between event3 and event4.
That's one possible explanation; there may be others depending on the exact circumstances. For example, with all that sleeping going on, one of the parallelism instances of the watermark generator is idle for at least a minute, which may be involved in producing the behavior being observed (depending on the value of idleness, etc).
More background:
With the parallelism being > 1, there are multiple, independent instances of the watermark strategy each doing their own thing based on the events they process.
Operators with multiple input channels, such as the keyed window, will combine these watermarks by taking the minimum of the incoming watermarks (from all non-idle channels) as their own watermark.
Answering the original question:
Does Flink retain the state for windows that have already been closed? No. Once the allowed lateness (if any) has expired, the state for an event time window is purged.

Kafka consumer configuration to fetch at interval

I am new to kafka and trying to understand various configuration properties I need to set for my requirement as below.
I have a kafka consumer which is expected to fetch 'n' records at a time and after successfully processing them, another fetch should happen.
Example: If my consumer fetches 10 records at a time and every record takes 5 seconds to complete its processing, then after 50 seconds another fetch request should get executed and so on.
Considering the above example, Could anyone let me know what should be the values for the below configs ?
Below is my current configuration. After processing 10 records, it doesn't wait for minute as I configured. It keeps on fetching and polling.
fetchMaxBytes=50000 //approx size for 10 records
fetchWaitMaxMs=60000 //wait for a minute before a next fetch
requestTimeoutMs= //default value
heartbeatIntervalMs=1000 //acknowledgement to avoid rebalancing
maxPollIntervalMs=60000 //assuming the processing takes one minute
maxPollRecords=10 //we need 10 records to be polled at once
sessionTimeoutMs= //default value
I am using camel-kafka component to implement this.
It would be really great if someone could help me with this. Thanks Much.

Flink session window with onEventTime trigger?

I want to create an EventTime based session-window in Flink, such that it triggers when the event time of a new message is more than 180 seconds greater than the event time of the message, that created the window.
For example:
t1(0 seconds) : msg1 <-- This is the first message which causes the session-windows to be created
t2(13 seconds) : msg2
t3(39 seconds) : msg3
.
.
.
.
t7(190 seconds) : msg7 <-- The event time (t7) is more than 180 seconds than t1 (t7 - t1 = 190), so the window should be triggered and processed now.
t8(193 seconds) : msg8 <-- This message, and all subsequent messages have to be ignored as this window was processed at t7
I want to create a trigger such that the above behavior is achieved through appropriate watermark or onEventTime trigger. Can anyone please provide some examples to achieve this?
The best way to approach this might be with a ProcessFunction, rather than with custom windowing. If, as shown in your example, the events will be processed in timestamp order, then this will be pretty straightforward. If, on the other hand, you have to handle out-of-order events (which is common when working with event time data), it will be somewhat more complex. (Imagine that msg6 with for time 187 arrives after t8. If that's possible, and if that will affect the results you want to produce, then this has to be handled.)
If the events are in order, then the logic would look roughly like this:
Use an AscendingTimestampExtractor as the basis for watermarking.
Use Flink state (perhaps ListState) to store the window contents. When an event arrives, add it to the window and check to see if it has been more than 180 seconds since the first event. If so, process the window contents and clear the list.
If your events can be out-of-order, then use a BoundedOutOfOrdernessTimestampExtractor, and don't process the window's contents until currentWatermark indicates that event time has passed 180 seconds past the window's start time (you can use an event time timer for this). Don't completely clear the list when triggering a window, but just remove the elements that belong to the window that is closing.

Designs for counting occurrences of events in streaming processing?

The following discussion is in the context of Apache Flink:
Imagine that we have a keyedStream whose key is its id and event time is its timestamp, if we want to calculate how many events arrived within 10 minutes for each event.
The problems need to be solved are:
How to design the window ?
We can create a window of 10 minutes after each event arrives, but this mean that for each event, there will be a delay of 10 minutes because the wait for the window of 10 minutes.
We can create a window of 10 minutes which takes the timestamp of each event as the maximum timestamp in this window, which means that we don't need to wait for 10 minutes, because we take the last 10 minutes of elements before the element arrives. But this kind of window is not easy to define, as far as I know.
How to deal with memory or other resource issues ? Even we succeed to create a window, maybe the kind of ids of events are diverse, so many window like this, how the system keep their states in the memory ? There is a big possibility of stakoverflow of memory.
Maybe there are some problems that I don't mention here, or maybe there are some good solutions except window(i.e. Patterns). If you have a good solutions, please give me a clue, thank you.
You could do this with a GlobalWindow and a Trigger than fires on every event and an Evictor that removes events that are more than 10 minutes old before counting the remaining events. (A naive implementation could easily perform very poorly, however.)
Yes, this may require keeping a lot of state -- you'll be keeping every event from the past 10 minutes (well, you only need to store the timestamp from each event). If you setup the RocksDB state backend then Flink will spill to disk if need be, but with some obvious performance penalty. Probably better to use a cluster large enough to hold 10 minutes of traffic in memory. Even at one million events per second, each with a 32-bit timestamp, that's only 2.4GB in 10 minutes (1 million events per second x 600 seconds x 4 bytes per event) -- doesn't seem like a problem at all.

How do handle page frame timing when not using a game framework?

If you don't use cocoa2d etc., how would you control the speed of a frame if you had to code this manually?
i.e. if you wanted things to operate in 50 frames per second (or whatever the industry best practice is?)
use CADisplayLink to get called at every frame. It will be max 60 FPS. If your code do too much work, you'll be called less often, and your UI will feel slow below 40 FPS.
Alternative is to schedule NSTimers, but it has some issues. If your runloop is not ready to call the timer on time, calls will be skipped, thus not guaranteeing any frame rate.
from apple's doc
A repeating timer always schedules
itself based on the scheduled firing
time, as opposed to the actual firing
time. For example, if a timer is
scheduled to fire at a particular time
and every 5 seconds after that, the
scheduled firing time will always fall
on the original 5 second time
intervals, even if the actual firing
time gets delayed. If the firing time
is delayed so far that it passes one
or more of the scheduled firing times,
the timer is fired only once for that
time period; the timer is then
rescheduled, after firing, for the
next scheduled firing time in the
future.