In a compacted Kafka topic, how to deal with relationships between entities? - apache-kafka

Let's say I have a compacted Kafka topic and I populate it with entities. The topic has only 1 partition.
For this example I use Employees as the entity. An Employee may have a Superior.
These are the messages in the topic:
Message number: 1
Key: 1
Values:
Employee id: 1
Employee name: Joe A.
Superior employee id: <null>
---
Message number: 2
Key: 2
Values:
Employee id: 2
Employee name: Frank L.
Superior employee id: 1
---
Message number: 3
Key: 1
Values:
Employee id: 1
Employee name: Joe A.-F. // Name of Employee 1 has changed.
Superior employee id: <null>
After some time, the topic is compacted. This means that for the Employee with id 1, the message number 1 is removed.
A client consuming this topic may want to build a relational model of Employees.
When the client now consumes the messages, it receives message number 2 as the first message.
Message number 2 contains a reference to a Superior with the id 1.
However, the Employee with id 1 does not exist yet, because message number 1 was removed. Employee 1 will only be received with the next message.
What's the best way to deal with these inconsistencies?
Sending all Employees in one single message in a hierarchical tree is probably not efficient, since there are many Employees.
When a Superior changes, should I resend all affected subordinate Employees, so that the relations can be consumed in the right order?
Is the consuming client responsible for dealing with inconsistent states? Does the client need to wait until the data is consistent?
Or, in a case such as this one, is it just not possible to use a compacted topic?

I think you should introduce the concept of message types; I guess the messages in your log should be something like this:
CreatedEmployee #1
EmployeeCreated #2
EmployeeNameChanged #1
So, if you compact by key, then you would keep the most recent of each message type, and because of that, the first message will not be deleted.
then the question is, do you really need compaction? Disk space is cheap, and messages are compact.

Related

How to guarantee message ordering over multiple topics in kafka?

I am creating a system in which I use kafka as an event store. The problem I am having is not being able to guarantee the message ordering of all the events.
Let's say I have a User entity and a Order entity. Right now I have the topics configured as follows:
user-deleted
user-created
order-deleted
order-created
When consuming these topics from the start (when a new consumer group registers) first the user-deleted topic gets consumed then the user-created etc. The problem with this is that the events over multiple topics do not get consumed chronologically, only within the topic.
Let's say 2 users get created and after this one gets deleted. The result would be one remaing user.
Events:
user-created
user-created
user-deleted
My system would consume these like:
user-deleted
user-created
user-created
Which means the result is 2 remaining users which is wrong.
I do set the partition key (with the user id) but this seems only to guarantee order within a topic. How does this problem normally get tackled?
I have seen people using topic per entity. Resulting in 2 topics for this example (user and order) but this can still cause issues with related enities.
What you've designed is "request/response topics", and you cannot order between multiple topics this way.
Instead, design "entity topics" or "event topics". This way, ordering will be guaranteed, and you only need one topic per entity. For example,
Topic users
For a key=userId, you can structure events this way.
Creates
userId, {userId: userId, name:X, ...}
Updates
userId, {userId: userId, name:Y, ...}
Deletes
userId, null
Use a compacted topic for an event-store such that all deletes will be tombstoned and dropped from any materialized view.
You could go a step further and create a wrapper record.
userId, {action:CREATE, data:{name:X, ...}} // full-record
userId, {action:UPDATE, data:{name:Y}} // partial record
userId, {action:DELETE} // no data needed
This topic acts as your "event entity topic", but then, you need a stream processor to parse and process these events consistently into the above format, such as null-ing any action:DELETE, and writing to a compacted topic (perhaps automatically using Kafka Streams KTable)
Kafka is not able to maintain ordering across multiple topics. It's not capable either to maintain ordering inside one topic that has several partitions. The only ordering guarantee we have is within each partition of one topic.
What this means is that if the order of user-created and user-deleted as known by a kafkfa producer must be the same as the order of those events as perceived by a kafka consumer (which is understandable as you explain), then those events must be sent to same kafka partition of the same topic.
Usually, you don't actually need the whole order to be exactly the same for the producer and producer (i.e. you don't need total ordering), but you need it to be the same at least for each entity id, i.e. for each user id the user-created and user-deleted event must be in the same order for the producer and the consumer, but it's often acceptable to have events mixed up across users (i.e. you need _partial ordering`).
In practice this means you must use the same topic for all those events, which means this topic will contain events with different schemas.
One strategy for achieving that is to use union types, i.e. you declare in your event schema that the type can either be a user-created or a user-deleted. Both Avro and Protobuf offer this feature.
Another strategy, if you're using Confluent Schema registry, is to allow a topic to be associated with several types in the registry, using the RecordNameStrategy schema resolution strategy. The blog post Putting Several Event Types in the Same Topic – Revisited is probably a good source of information for that.

Detected out-of-order KTable update when updating same GlobalKtable from different input topics

I have 2.4.1 streams app which tracks 2 different topics lets say topic A and B, joins them with one GlobalKtable and sends messages to the source topic of this GlobalKtable. So like output of the app will be one of the input sources.
The problem I am experiencing is with 2 close messages (< 200 mls) in topics A and B, the output of the app from first message in topic A is not considered when next message from topic B is processed. So GlobalKtable state is rewritten with incorrect results produced by second B join with table. And next warn is shown in the logs
| WARN | org.apache.kafka.streams.kstream.internals.KTableSource |
Detected out-of-order KTable update for store at
offset 3, partition 5. | |
The questions are next:
Am I understanding correctly from https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization that Kafka streams buffer some messages for processing and don't consider any updates done by these messages to app-state
To make the logic behave correctly I need to break app for two for each topic, is there any chance there will be same problem?
Can kafka-streams be improved to correctly process such situations, like make iteration not every n buffered messages/ n time, but every input message?
UPDATE
I tried to use 2 different streams aps (point 2):
topic A + join table -> source topic of table.
topic B + join table -> source topic of table.
Now see different behavior, but result is same, even with max.task.idle.ms=1000
Topic A: timestamp 1626946183424
Topic B: timestamp 1626946183427
Output - source table topic:
Table topic: 0 1 timestamp 1626946183427
Table topic: 0 2 timestamp 1626946183424
Out of order message too.
Detected out-of-order KTable update for store at offset 2, partition 0
Detected out-of-order KTable update for store at offset 2, partition 0

How to design Kafka Produce (partition) to guarantee this partial order when consuming?

Here is my case:
System produces messages to one topic, and there are two kind of messages:
A. new users messages
produce: every time any user data changed.
markded as: U1, U2, ... Un.
B. user attribute metadata change messages
e.g: user has two attributes name, email, then added a custom attribute profile.
produce: every time user attribute metadata changed.
marked as: M
When we consume this topic, we need to guarantee partial orders:
Same User's data should follow its order.
consumption of metadata change should always: before consuming user data message after this change, after user data message before this change.
Example:
message natural order:
(0:U1)->(1:U2)->(2:U1)->(3:U3)->(4:U1)->(5:M)->(6:U1)->(7:U2)->(8:U2)->(9:M)->(10:U1)
accepted consuming order:
(0:U1)->(2:U1)->(3:U3)->(1:U2)->(4:U1)->(5:M)->(7:U2)->(6:U1)->...
The question
If there is no M in it, I can put different User data into different partitions, to increase throughout, but consider the existence of M's ordering requirement, can I make different partition for this topic?
You can use any kind of user identifier (should be equal for "new users messages" and "user attribute metadata change messages" of the same user and at the same time unique for a particular user) as a key of the Kafka message. That way, the data will get partitioned based on the user identifier and you ensure that the data of one user will go to a single partition while keeping the order. That way you can scale with multiple partitions.
When producing the messages to the topic, make sure to synchronously produce the data, e.g. wait till the first message is received before sending the second.

Order between dependent objects with Kafka Streams

I am reading data from a RestfulAPI which represent dependent entities.
e.g from /students I get student objects and from /teachers I get teacher object.
Student is connected to Teacher object (student has teacher Id).
The problem is that I produce from /students to Kafka into students topic and from /teachers to teachers topic but when I try to join between them with Kafka Streams, sometimes the event of student comes before its teacher event has arrived thus I do not receive the joined record of student and teacher (due to early arrived students).
To use window is not optimal because I would like to get student updates all the time.
My question is - how do I sync the events so I'll be able to resolve depending objects.
Currently I'm polling the API service manually and produce the results to Kafka - is there any way to use Kafka Connect instead with the Rest API as a source in a simple way?
The following approach should help:
Create a stream for the Teachers topics, since incoming records will be stable.
To handle an incoming flow of students, create a KTable for Students.
Perform an non-windowed join between teachers and students.
KTable is a changelog stream, so all incoming records will be treated as inserts or updates.
You can refer to this documentation.

Add a type to messages in Kafka?

We are starting to use Kafka in a backend redevelopment, and have a quick question about how to structure the messages that we produce and consume.
Imagine we have a user microservice that handles CRUD operations on users. The two structures that have been put forward as a possibility are:
1) Four kafka topics, one for each operation. The message value would just contain the data needed to perform the operation, i.e.
topic: user_created
message value: {
firstName: 'john'
surname: 'smith'
}
topic: user_deleted
message value: c73035d0-6dea-46d2-91b8-d557d708eeb1 // A UUID
and so on
2) A single topic for user related events, with a property on the message describing the action to be taken, as well as the data needed, i.e.
// User created
topic: user_events
message value: {
type: 'user_created',
payload: {
firstName: 'john'
surname: 'smith'
}
}
// User deleted
topic: user_events
message value: {
type: 'user_deleted',
payload: c73035d0-6dea-46d2-91b8-d557d708eeb1 // A UUID
}
I am in favour of the first system described, although my inexperience with Kafka renders me unable to argue strongly why. We would greatly value any input from more experienced users.
Kafka messages don't have a type associated with them.
With a topic-per-event-type you would have to worry about ordering of events pertaining to the same entity read from the different topics. For this reason alone I would recommend putting all the events in the same topic. That way clients just have to consume a single topic to be able to fully track the state of each entity.
I worked on this kind of architecture recently.
We used an API Gateway, Which was the Webservice that communicates with our front end (ReactJS in our case). This API gateway used REST protocol. That microservice, developed with Spring Boot, acts as a producer and consumer in a separate thread.
1- Producing Message: Send message to Kafka broker on topic "producer_topic"
2- Consuming Message: Listen to the incoming messages from Kafka on topic "consumer_topic"
For consuming there was a pool of threads that handled the incoming messages and execution service which listen to Kafka stream and send assign the message handling to a thread from the pool.
Bottom to that there was a DAO Microservice that handle kafka messages and did the CRUD stuff.
Messages format looked really like your second approach.
//content of messages in the consumer_topic
{
event_type: 'delete'
message: {
first_name: 'John Doe'
user_id: 'c73035d0-6dea-46d2-91b8-d557d708eeb1'
}
}
This is why I should recommend you the second approach. There is less complexity as you handle all crud operations with only one topic. It's really fast due to partitions parallelism and you can add replication for being more fault tolerant.
The first approach sounds good in term of dematerialization and separation of concerns, but it's not really scalable. For instance let's say you want to add additional operation, it's one more topic to add. Also look at the replication. You will have more replicas to do and that's pretty bad I think.
Following the Tom advice, remember that even if you use a single topic you could chose to have more than one partitions for consumer scalability. Kafka provides you ordering at partition level so not at topic level. It means that you should use a "key" for identifying a resource you are creating, deleting, updating in order to have the message related to this "key" always in the same partition so with the right order otherwise even with a single topic you could lose the messages order having messages sent on different partitions.
Kafka 0.11 adds Message Headers which is an easy way of indicating different message types for the body of the message, even if they are all using the same serializer.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers