Is there a way to tell which event occurred first in two kafka topics - apache-kafka

If I have two topics in kafka, is there a way to tell if one event in one topic "occured" before an event in another topic if they both come in within a millisecond of each other ie they have the same timestamp?
Background:
I am building an event sourcing based event drive architecture. Often, when an event occurs in one topic, I need to do a scan to find if a separate event has already occurred in a second topic. Likewise, if the event in the second topic comes in, I need to scan to see if the event in topic one occurred.
In order to not duplicate processing, I need a deterministic way to order the events. If the events are more than 1 millisecond apart, I can just use the timestamp in the event. But, because kafka timestamps only go to the millisecond, when two events occur close together, I can no longer use this approach.
In reality, I don't care which topic occured "first", ie if kafka posted one before another, even if they came in a different order, I don't care. I just need a deterministic way to order them.
In reality, I can use some method, such as arranging the events by topic alphabetically, but was hoping there was a built-in mechanism. (don't want to introduce weird bugs because I always process event A before event B; unlikely, but I've seen it happen)
PS I am open to other ideas. I'm thinking this approach because it was possible in redis streams. However, because of things I can't control, I am restricted to kafka. I do want to avoid using an external data store as then I need to start worrying about data synchronization in there.

You're going to run into synchronization issues, regardless. For example - you could try using a stream-topic join in Kafka Streams. If the event doesn't exist for the join, then it hasn't happened yet, but then you're reliant on having absolutely zero lag in the consumer processes building that KTable.
You could try storing nanoseconds as part of the value or header when you create the record if you need higher precision, but again, you're going to either need absolute zero lag or very precise consumer poll events with some comparison window as Kafka does not provide any processing guarantees across multiple topics

Related

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

Event Driven Architectures - Topics / Stream Design

This could be some kind of best practice question. Someone please who has worked on this clarify with examples. So that all of us could benefit!
For event-driven architectures with Kafka / Redis, when we create topics/streams for events, what are all the best practices to be followed.
Lets consider online order processing workflow.
I read some blogs saying that create topics/streams like order-created-events, order-deleted-events etc. But my question is how the order of the messages is guaranteed when we split this into multiple topics.
For ex:
order-created-events could have thousands of events and being slowly processed by a consumer. order-deleted-events could have only few records in the queue assuming only 5-10% would cancel the order.
Now, lets assume, an user first places an order. then he immediately cancels. This will make the order-deleted-event to process first as the topic/stream do not have much messages before some consumer processes order-created-event for the same order. It will cause some data inconsistency.
Hopefully my question is clear. So, how to come up with topics/streams design?
Kafka ensures sequencing for a particular partition only.
So, to take use of kafka partitioning and load balancing using partitions, multiple partitions for a single topic( like order) should be created.
Now, Use a partition class to generate a key for every message and that key should correspond to same partition only.
So , irrespective of Order A getting created , updated or deleted , they should always belong to same partition.
To properly achieve sequencing , this should be the basis of deciding topics , instead of 2 different topics for different activities.

Kafka: Is it good practice too keep topic offset in database?

I have started learning kafka. I don't have much idea of live project where kafka is used.
Wanted to know if offset can be saved in database apart from committing in broker?
I think it should always be saved otherwise some record will be missed or re-processed.
Taking an example if offset is not saved in database, when application(consumer) is deployed or restarted during that time if some message is sent to broker at that time, that will be missed as when consumer will be up it will read next onward record or(from start)
the short answer to your question is "its complicated" :-)
the long answer to your question is something like:
kafka (without extra configuration and/or careful design of your code) is an at-least-once system (see official documentation). this means that yes, your consumer may see a particular set of records more than once. this wont happen on a graceful shutdown/rebalance, but will definitely happen if your application crashes.
newer versions of kafka support so called "exactly once". this involves configuring your clients differently (and a significant performance and latency hit), and the guarantees only ever hold if all your inputs and outputs are from/to the exact same kafka cluster. so if your consumer does anything like call an external HTTP API or insert into a database in response to seeing a kafka record we are back to at-least-once.
if your outputs go to a transactional system (like a classic ACID database) a common pattern would be to start a transaction, and in that transaction record both your outputs and the consumer offsets (you would also need to change your code to restore from these DB offsets and not the kafka default). this has better guarantees (but still wont help if your code interacts with non-transactional systems, like making an HTTP call)
another common design pattern to overcome at-least-once is to somehow "tag" every operation you do (record you produce, http call you make ...) with some UUID that derives from the original kafka records comsumed to produce this output. this means if your consumer sees the same record again, it will perform the same operations again, and repeat the same "tag" value. this shifts the burden to downstream systems that must now remember (at least for some period of time) all the "tags" they have seen so they could disregard a repeat operation, or somehow design all your operations to be idempotent

Kafka Streams Sort Within Processing Time Window

I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.
Imagine the following situation as an example (arbitrary one, but similar to what I need):
There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.
Notice:
We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.
I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).
You can see my answer to a similar question here:
https://stackoverflow.com/a/44345374/7897191
Since your message keys are already unique you can ignore my comments about de-duplication.
Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

Delayed message consumption in Kafka

How can I produce/consume delayed messages with Apache Kafka? Seems like standard Kafka (and Java kafka-client) functionality doesn't have this feature. I know that I could implement it myself with standard wait/notify mechanism, but it doesn't seem very reliable, so any advices and good practices are appreciated.
Found related question, but it didn't help.
As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
Indeed, kafka lowest structure is a partition, which are sequential events in a queue with incremental offset - you can't insert a log anywhere else than the end at the moment you produce it. There is no concept of delayed messages.
What do you want to achieve exactly?
Some possibilities in your case:
You want to push a message at a specific time (for example, an event "start job"). In this case, use a scheduled task (not from kafka, use some standard way on your os / language / custom app / whatever) to send the message at the given time - consumers will receive them at the proper time.
You want to send an event now, but which should not be taken into account now by consumers. In this case, you can use a custom structure which would include a "time" in its payload. Consumers will have to understand this field and have custom processing to deal with it. For exemple: "start job at 2017-12-27T20:00:00Z". You could also use headers for this, but headers are not supported by all clients for now.
You can change the timestamp of the message sent. Internally, it would still be read in order, but some functions implying time would work differently, and consumer could use the timestamp of the message for its action - this is kinda like the previous proposition, except the timestamp is one metadata of the event, and not the event payload itself. I would not use this personally - I only deal with timestamp when I proxy some events.
For your last question: basically, yes, but with some notes:
Topics are actually split in partition, and order is only preserved in partition. All message with same key are send to same partition.
Most of time, you only read from memory, except if you read old events - in this case, as those are sequentially read from disk, this is very fast
You can choose where to begin to read - a given offset or a given time - and even change it at runtime
You can parallelize read across process - multiple consumers can read the same topics and never reading the same messages twice (each reading different partition, see consumer groups)