kafka Streams session windows

kafka Streams session windows - apache-kafka

Hello I am working on kafka session window with inactive time 5 mins. I want some kind of feedback when inactive time is reached and session is drooped for the key.
lets assume I have
(A,1)
record where 'A' is the key. now if i don't get any 'A' key record in 5 mins the session is dropped.
I want to do some operation on end of session lets say (value)*2 for that session. is there any way I can achieve this using Kafka Stream API

Kafka Streams does not drop a session after the gap-time passed. Instead, if will create a new session if another record with the same key arrives after the gap-time passed and maintain both session in parallel. This allows to handle out-of-order data. It could even happen, that two session get merged if an out-of-order data falls into a gap and "connects" both sessions with each other.
Sessions are maintained for 1 day by default. You can change this via SessionWindows#until() method. If a session expires it will be dropped silently. There is no notification. You also need to consider config parameter window.store.change.log.additional.retention.ms:
The default retention setting is Windows#maintainMs() + 1 day. You can override this setting by specifying StreamsConfig.WINDOW_STORE_CHANGE_LOG_ADDITIONAL_RETENTION_MS_CONFIG in the StreamsConfig.
Thus, you want to do react if time passed, you should look into punctuations that allow you to register regular callbacks (some kind of timer) either based on "even time progress" or wall-clock time. This allows you to react if a session is not update for a certain period of time and you think it's "completed".

Related

Does Apache Beam stateful Processing consider window lateness constraints (withAllowedLateness) for resetting state?

I'm trying to implement a valueState to filter records in my ParDo transformation. The high level flow is this:
Fixed-Window of 1hr size, with allowedLateness (10min)
The first message (for a given key) that is processed in the ParDo shall set the valueState(boolean) to true. Subsequent messages for the same key shall be dropped if corresponding valueState is set to true. (Allow only first message for a given key in every window).
The messages (that are not dropped in step 2) will be written out as output.
While testing this however, I see that, after the Fixed window time-period ends (1hr), the state is reset/lost. Ideally, the state should be available to process late records until allowedLateness period (10min is complete).

These parts are right:
Each 1 hour window expires when the watermark reaches the end of the hour plus 10 minutes.
For a given window, the state is cleaned up after the window expires.
Here are the parts that I have corrections
State is never reset.
Elements with timestamps in different windows are processed totally independently. Many windows may be receiving data at the same time. Each hour window happened after another, when the data was generated. But it is not processed after the other.
Allowed lateness will not cause elements from a later window to be processed using the state from the prior window. It will simply allow the state to stay longer and the elements to not be dropped.

Is the mongo timestamp type atomic with the reads?

I guess the title is confusing, but I could not find a better one.
I have an event stream in MongoDB with multiple producers and one consumer. To ensure that I read each event exactly once in the correct order, I use the MongoDB timestamp type as an incrementing value, populated by the server. In the SQL world I would probably use an auto-incremented integer.
My consumer just polls MongoDB and asks for all events since the last timestamp it has seen. In one of the environments we have realized that sometimes the consumer does not handle all events. It does not happen very often, like one of 50.000 events is missed, but ideally it should not happen at all.
My assumption is that MongoDB does something like this internally.
ParseDocument(doc);
lock
{
SetTimestamp(doc);
}
WriteDocument(doc);
UpdateIndex(doc);
So it could happen that for a very short period of time an document is not available when the consumer queries the events, because only event #1, #2 and #4 is written yet and event #3 is written a fraction of a millisecond later.
I Have seen this with a C# client and MongoDB 4.2 running in Docker, but I guess the client does not matter here.
Is this assumption correct and if yes, what can I do it?
My idea is to change my consumer to ask for all events since the last timestamp minus a few seconds and then filter out the already received events in the consumer.
But is there a more elegant solution? Perhaps some way to enforce collection level write locks or could transactions help?

Since you said "consumer" - singular, I suggest:
Use a change stream to be notified of events. Change stream, if correctly iterated, will not skip changes nor will it return the same change twice.
Whenever a document is returned from change stream, when it is processed by the singular consumer, add a counter to it. Since there is only one consumer it is relatively easy to implement the counter without race conditions and such.
Also write the current resume token into each event being processed.
If you wish, you can use the counter to uniquely identify the events.
To iterate events again, use the counter to look up events in the past. Given that each event has both a counter and a resume token, once you get to the most recent event you can seamlessly transition from iterating based on the counter to iterating based on the resume token.

RDBMS Event-Store: Ensure ordering (single threaded writer)

Short description about the setup:
I'm trying to implement a "basic" event store/ event-sourcing application using a RDBMS (in my case Postgres). The events are general purpose events with only some basic fields like eventtime, location, action, formatted as XML. Due to this general structure, there is now way of partitioning them in a useful way. The events are captured via a Java Application, that validate the events and then store them in an events table. Each event will get an uuid and recordtime when it is captured.
In addition, there can be subscriptions to external applications, which should get all events matching a custom criteria. When a new matching event is captured, the event should be PUSHED to the subscriber. To ensure, that the subscriber does not miss any event, I'm currently forcing the capture process to be single threaded. When a new event comes in, a lock is set, the event gets a recordtime assigned to the current time and the event is finally inserted into the DB table (explicitly waiting for the commit). Then the lock is released. For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime. When the matching events are successfully pushed to the subscriber, the subscription_recordtime is set to the events max recordtime.
Everything is actually working but as you can imagine, a single threaded capture process, does not scale very well. Thus the main question is: How can I optimise this and allow for example multiple capture processes running in parallel?
I already thought about setting the recordtime in the DB itself on insert, but since the order of commits cannot be guaranteed (JVM pauses), I think I might loose events when two capture transactions are running nearly at the same time. When I understand the DB generated timestamp currectly, it will be set before the actual commit. Thus a transaction with a recordtime t2 can already be visible to the subscription query, although another transaction with a recordtime t1 (t1 < t2), is still ongoing and so has not been committed. The recordtime for the subscription will be set to t2 and so the event from transaction 1 will be lost...
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed? Every newly visible event must have a later timestamp then the event before (strictly monotonically increasing). I know about a full table lock, but I think, then I will have the same performance penalties as before.
Is it possible to set the DB to use a single threaded writer? Then each capture process would also be waiting for another write TX to finished, but on a DB level, which would be much better than a single instance/threaded capture application. Or can I use a different field/id for tracking the current state? Normal sequence ids will suffer from the same reasons.

Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed?
You should not be concerned with global ordering of events. Your events should contain a Version property. When writing events, you should always be inserting monotonically increasing Version numbers for a given Aggregate/Stream ID. That really is the only ordering that should matter when you are inserting. For Customer ABC, with events 1, 2, 3, and 4, you should only write event 5.
A database transaction can ensure the correct order within a stream using the rules above.
For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime.
Reading events is a slightly different story. Firstly, you will likely have a serial column to uniquely identify events. That will give you ordering and allow you to determine if you have read all events. When you read events from the store, if you detect a gap in the sequence. This will happen if an insert was in flight when you read the latest events. In this case, simply re-read the data and see if the gap is gone. This requires your subscription to maintain it's position in the index. Alternatively or additionally, you can read events that are at least N milliseconds old where N is a threshold high enough to compensate for delays in transactions (e.g 500 or 1000).
Also, bear in mind that there are open source RDBMS event stores that you can either use or leverage in your process.
Marten: http://jasperfx.github.io/marten/documentation/events/
SqlStreamStore: https://github.com/SQLStreamStore/SQLStreamStore

How to do Kafka stream transformations (map / flatMap) taking into account values in a Key/Value store?

My task is the following:
I am monitoring time synchronization events from a third-party measuring device. This time synchronization is a bit flaky so I want to detect when synchronization stops and issue an alarm.
For this, I am producing the synchronization events to a Kafka topic. I have three different events going on:
Synchronization request
Synchronization successful
Synchronization failed because other device did not respond
So, what I want to do:
When request is received, and nothing is received after a certain amount of time, I want to issue a "timeout" alarm
When request is received, and within the timeout period, a success event arrives, I want to issue a "timeout" if no request arrives after the timeout time
When a failure event arrives, I want to issue the "other device did not respond" alarm
I am currently in the process of setting up a Kafka-Streams application, and I need to store the state in case this application crashes (should not, but I want to be sure), so I set this up the following:
val builder = new StreamsBuilder
val storeBuilder = Stores.
keyValueStoreBuilder(Stores.persistentKeyValueStore("timesync-alarms"),
Serdes.String(),
logEntrySerde)
builder.addStateStore(storeBuilder)
val eventStream = builder.stream(sourceTopic, Consumed.`with`(Serdes.String(), logEntrySerde))
Now, I am stuck. What I basically think I need to do have a flatMap function on the eventStream, that, whenever an event arrives:
Queries the store for the last event that was processed
Decides if an alarm is to be raised
Updates the store with the currently-received event
Produces the alarm, if any
So, how do I achieve steps 1 and 3 here? Or am I conceptually wrong and have to do it differently?

I think you don't need to use state store directly. You can create two streams - one with sync request events, the second one with sync responses (success, fail) and join them:
requestStream.outerJoin(responseStream, (leftVal, rightVal) -> ...,
JoinWindows.of(timeout), ...);
In the case of timeout rightVal is null.
If you want to send alarms to a separate topic, you can simply filter the joined stream and write all failures (error responses and timeouts) to the topic. Otherwise you can use peek() method and trigger some action inside. Here is a simple example: https://github.com/djarza/football-events/blob/master/football-ui/src/main/java/org/djar/football/ui/projection/StatisticsPublisher.java

Distributed timer service

I am looking for a distributed timer service. Multiple remote client services should be able to register for callbacks (via REST apis) after specified intervals. The length of an interval can be 1 minute. I can live with an error margin of around 1 minute. The number of such callbacks can go up to 100,000 for now but I would need to scale up later. I have been looking at schedulers like Quartz but I am not sure if they are a fit for the problem. With Quartz, I will probably have to save the callback requests in a DB and poll every minute for overdue requests on 100,000 rows. I am not sure that will scale. Are there any out of the box solutions around? Else, how do I go about building one?

Posting as answer since i cant comment
One more options to consider is a message queue. Where you publish a message with scheduled delay so that consumers can consume after that delay.
Amazon SQS Delay Queues
Delay queues let you postpone the delivery of new messages in a queue for the specified number of seconds. If you create a delay queue, any message that you send to that queue is invisible to consumers for the duration of the delay period. You can use the CreateQueue action to create a delay queue by setting the DelaySeconds attribute to any value between 0 and 900 (15 minutes). You can also change an existing queue into a delay queue using the SetQueueAttributes action to set the queue's DelaySeconds attribute.
Scheduling Messages with RabbitMQ
https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/
A user can declare an exchange with the type x-delayed-message and then publish messages with the custom header x-delay expressing in milliseconds a delay time for the message. The message will be delivered to the respective queues after x-delay milliseconds.

Out of the box solution
RocketMQ meets your requirements since it supports the Scheduled messages:
Scheduled messages differ from normal messages in that they won’t be
delivered until a provided time later.
You can register your callbacks by sending such messages:
Message message = new Message("TestTopic", "");
message.setDelayTimeLevel(3);
producer.send(message);
And then, listen to this topic to deal with your callbacks:
consumer.subscribe("TestTopic", "*");
consumer.registerMessageListener(new MessageListenerConcurrently() {...})
It does well in almost every way except that the DelayTimeLevel options can only be defined before RocketMQ server start, which means that if your MQ server has configuration messageDelayLevel=1s 5s 10s, then you just can not register your callback with delayIntervalTime=3s.
DIY
Quartz+storage can build such callback service as you mentioned, while I don't recommend that you store callback data in relational DB since you hope it to achieve high TPS and constructing distributed service will be hard to get rid of lock and transaction which bring complexity to DB coding.
I do suggest storing callback data in Redis. Because it has better performance than relational DB and it's data structure ZSET suits this scene well.
I once developed a timed callback service based on Redis and Dubbo. it provides some more useful features. Maybe you can get some ideas from it https://github.com/joooohnli/delay-callback

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse