How to design Kafka Produce (partition) to guarantee this partial order when consuming? - apache-kafka

Here is my case:
System produces messages to one topic, and there are two kind of messages:
A. new users messages
produce: every time any user data changed.
markded as: U1, U2, ... Un.
B. user attribute metadata change messages
e.g: user has two attributes name, email, then added a custom attribute profile.
produce: every time user attribute metadata changed.
marked as: M
When we consume this topic, we need to guarantee partial orders:
Same User's data should follow its order.
consumption of metadata change should always: before consuming user data message after this change, after user data message before this change.
Example:
message natural order:
(0:U1)->(1:U2)->(2:U1)->(3:U3)->(4:U1)->(5:M)->(6:U1)->(7:U2)->(8:U2)->(9:M)->(10:U1)
accepted consuming order:
(0:U1)->(2:U1)->(3:U3)->(1:U2)->(4:U1)->(5:M)->(7:U2)->(6:U1)->...
The question
If there is no M in it, I can put different User data into different partitions, to increase throughout, but consider the existence of M's ordering requirement, can I make different partition for this topic?

You can use any kind of user identifier (should be equal for "new users messages" and "user attribute metadata change messages" of the same user and at the same time unique for a particular user) as a key of the Kafka message. That way, the data will get partitioned based on the user identifier and you ensure that the data of one user will go to a single partition while keeping the order. That way you can scale with multiple partitions.
When producing the messages to the topic, make sure to synchronously produce the data, e.g. wait till the first message is received before sending the second.

Related

Get api requests to be processed sequentially

I am sending messages with two different users in the same method in the api How can I ensure that the messages of the second user are not read before the reading of the messages of the first user ends?
This is how it works right now 1 message from the first request 2. he reads a message from the request. First, read the messages in the first request, and then 2. how can I make it optional?
Ordering in Kafka is achieved by partition key. Every topic is internally divided into partitions and producer can optionally control the message-key while events are published to a topic.
If no key specified, it gets assigned to a random partition. But if we want to maintain an order to events, then make a unique key. Ex: customer Id can OR session key will be a good option for this. This will make sure that events published are being appended back to back offset in a partition.
Your requirement can be achieved by assigning a key as mentioned above but if it was the other way around, like second one to be consumed first followed by first message, I think it can never be achieved by Kafka alone.
Looking at your input, you are using a webservice to publish the events. Check if the service has got an option to mention the key for message. If its there, then use the same key for two events and it will be sorted. Otherwise you may need to enhance this service to accept keys as well, not just value.
A little more advanced one for batch processing : https://cwiki.apache.org/confluence/display/KAFKA/KIP-480%3A+Sticky+Partitioner

Applying KTable enrichment on data prior to filling row

I've got sales messages with timestamps and several messages belonging to the same sale share the same ID. But only one contains a field that I want to store in a KTable to enrich follwing messages with the corresponding ID.
I cannot be sure that the message with the necessary field will always be sent first.
Is it possible to do a Join including also the messages prior to populating the KTable (let's say timestamps - 5min)?
(What if your data comes in batches with breaks of x min?)
Thank you!
Not 100% sure if I understand the use case, but it seems you only want to store a message if it contains the corresponding field, but you want to drop the message otherwise. For this case, you could read the data as a KStream and apply a filter before you put the records into a table:
KStream input = builder.stream("table-topic");
KTable table = input.filter(/*contains field or is tombstone*/)
.toTable();
Note, that you might want to ensure that tombstone messages, ie, messages with value==null are not filtered out, to preserve delete semantics (seems to be use-case dependent).

Kafka to store the message on single partition for a user?

I have a ecommerce like system which produces user events of different kind .
I need to store them in kafka for asynch data analysis. I want events for specific users goes to one queue partition so that consumers gets all messages
on one partition . This won't be dedicated queue for a user. Which means single partition can store the data for multiple customer. Not sure how
I can achieve it in kafka ?
To send messages of specific users to the same partition, you can use the key= parameter of producer's send method. You can set this parameter to a byte encoded value which must be unique.
For example, in Python:
producer.send("topic", json.dumps(msg).encode()), key=str(user_id).encode())
This will ensure that messages concerning a given user will be pushed into the same topic's partition.
#zebra8844 answer is correct. The same key will always go to the same partition unless you increase the number of partitions in the future then this will not be the case. So just keep this in mind for future.

Consume only topic messages for a given account

I have a service calculating reputation scores for accounts. It puts the calculation results in a Kafka topic called "ReputationScores". Each message looks something like this:
{ "account" : 12345, "repScore" : 98765}
I'd like my consumer to be able to consume only those messages for a specific account.
For example, I’d like to have a single instance of a consumer consume only messages with topic “ReputationScore” for account 12345. That instance should probably be the only member of its consumer group.
Can Kafka filter based on message contents? What's the best way to do this?
Thanks for your help.
Can Kafka filter based on message contents?
Since kafka itself doesn't know what's in your data, it cannot index it, therefore it's not readily searchable. You would need to process the full topic and have an explicit check for which deserialized records you want to parse. For example, this is what a stream processing application with a simple filter operation would provide you.
If you want to preserve the ability to do lookups by a particular item, you will either need to make a partitioner that segments all data you're interested in, or create a topic per item (which really only works for certain use cases, not things like individual user accounts).
You could look at inserting all events to an in-memory database, then performing queries against that

Kafka Processor API: Different key for Source and StateStore?

we are currently implementing a process (using the Kafka Processor API) were we need to combine information from 2 correlated events (messages) on a topic and then forward those combined information. The events originate from IoT devices and since we want to keep them in order, the source topic uses a device identifier as key. The events also contain a correlation ID:
Key
{ deviceId: "..." }
Message
{ deviceId: "...", correlationId: "...", data: ...}
Our first approach was to create a Processor that has a connected State Store, which stores every incoming message, using the correlation ID as key. That enables us to query the store for the correlation ID of an incoming message, and if there already is a message with the same ID in the store, we can combine the information, forward a new event and remove the entry from the store. So for every correlation ID the following happens: at some point the first message with that ID is consumed and stored and at some other point in the time the second message with that ID results in the store entry being removed.
State Key
{ correlationId: "..." }
State Value
{ event: { deviceId: "...", correlationId: "...", data: ... }}
But now we are wondering how Kafka Streams is handling the different keys. We are using a Microservice approach and there will be multiple instances of that service running. The store is automatically backed by an internal topic. Consider re-scaling the service instances, s.t. the partitions of the source topic and the state topic are re-balanced. Is is possible that the partition for a specific correlation ID is assigned to another service than the partition for the corresponding device ID? Could we end up in a situation were the second event with the same correlation ID would be consumed by a service instance, that does not have access to the already stored first event?
Thanks in advance!
If I understand your setup correctly, then yes, the approach is correct and (re)scaling will just work.
TL;DR: If a stream task is moved from machine A to machine B, then all its state will be moved as well, regardless of how that state is keyed (in your case it happens to be keyed by correlationId).
In more detail:
Kafka Streams assigns processing work to stream tasks. This happens by mapping input partitions to stream tasks in a deterministic manner, based on the message key in the input partitions (in your case: keyed by deviceId). This ensures that, even when stream tasks are being moved across machines/VMs/containers, they will always see "their" input partitions = their input data.
A stream tasks consists, essentially, of the actual processing logic (in your case: the Processor API code) and any associated state (in your case: you have one state store that is keyed by correlationId). What's important for your question is that it does not matter how the state is keyed. It's only important how the input partitions are keyed, because that determines which data flows from the input topic to a specific stream task (see previous bullet point). When a stream task is being moved across machines/VM/containers, all its state will be moved as well so that it always has "its own" state available.
The store is automatically backed by an internal topic.
As you already suggested, the fact that a store has an internal topic (for fault-tolerance and for elastic scaling, because that internal topic is used to reconstruct a state store when its stream task was moved from A to B) is an implementation detail. As a developer using the Kafka Streams API, the handling of state store recovery is automagically and transparently done for you.
When a stream task is being moved, and thus its state store(s), then Kafka Streams knows how it needs to reconstruct the state store(s) at the new location of the stream task. You don't need to worry about that.
Is is possible that the partition for a specific correlation ID is assigned to another service than the partition for the corresponding device ID?
No (which is good). A stream task will always know how to reconstruct its state (1+ state stores), regardless of how that state itself is keyed.
Could we end up in a situation were the second event with the same correlation ID would be consumed by a service instance, that does not have access to the already stored first event?
No (which is good).