Add a type to messages in Kafka? - apache-kafka

We are starting to use Kafka in a backend redevelopment, and have a quick question about how to structure the messages that we produce and consume.
Imagine we have a user microservice that handles CRUD operations on users. The two structures that have been put forward as a possibility are:
1) Four kafka topics, one for each operation. The message value would just contain the data needed to perform the operation, i.e.
topic: user_created
message value: {
firstName: 'john'
surname: 'smith'
}
topic: user_deleted
message value: c73035d0-6dea-46d2-91b8-d557d708eeb1 // A UUID
and so on
2) A single topic for user related events, with a property on the message describing the action to be taken, as well as the data needed, i.e.
// User created
topic: user_events
message value: {
type: 'user_created',
payload: {
firstName: 'john'
surname: 'smith'
}
}
// User deleted
topic: user_events
message value: {
type: 'user_deleted',
payload: c73035d0-6dea-46d2-91b8-d557d708eeb1 // A UUID
}
I am in favour of the first system described, although my inexperience with Kafka renders me unable to argue strongly why. We would greatly value any input from more experienced users.

Kafka messages don't have a type associated with them.
With a topic-per-event-type you would have to worry about ordering of events pertaining to the same entity read from the different topics. For this reason alone I would recommend putting all the events in the same topic. That way clients just have to consume a single topic to be able to fully track the state of each entity.

I worked on this kind of architecture recently.
We used an API Gateway, Which was the Webservice that communicates with our front end (ReactJS in our case). This API gateway used REST protocol. That microservice, developed with Spring Boot, acts as a producer and consumer in a separate thread.
1- Producing Message: Send message to Kafka broker on topic "producer_topic"
2- Consuming Message: Listen to the incoming messages from Kafka on topic "consumer_topic"
For consuming there was a pool of threads that handled the incoming messages and execution service which listen to Kafka stream and send assign the message handling to a thread from the pool.
Bottom to that there was a DAO Microservice that handle kafka messages and did the CRUD stuff.
Messages format looked really like your second approach.
//content of messages in the consumer_topic
{
event_type: 'delete'
message: {
first_name: 'John Doe'
user_id: 'c73035d0-6dea-46d2-91b8-d557d708eeb1'
}
}
This is why I should recommend you the second approach. There is less complexity as you handle all crud operations with only one topic. It's really fast due to partitions parallelism and you can add replication for being more fault tolerant.
The first approach sounds good in term of dematerialization and separation of concerns, but it's not really scalable. For instance let's say you want to add additional operation, it's one more topic to add. Also look at the replication. You will have more replicas to do and that's pretty bad I think.

Following the Tom advice, remember that even if you use a single topic you could chose to have more than one partitions for consumer scalability. Kafka provides you ordering at partition level so not at topic level. It means that you should use a "key" for identifying a resource you are creating, deleting, updating in order to have the message related to this "key" always in the same partition so with the right order otherwise even with a single topic you could lose the messages order having messages sent on different partitions.

Kafka 0.11 adds Message Headers which is an easy way of indicating different message types for the body of the message, even if they are all using the same serializer.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers

Related

How to check if message is in the queue already

Is there a way to check if kafka queue already has certain message?
I do not want to consume it but just check if it's already in queue. E.g. my message is a simple JSON object:
{
id: 123,
name: "message"
}
So I want to check if message with id: 123 is already in queue so my app do not send it second time.
I have Node.js service and using kafkajs npm library
I do not want to consume it but just check if it's already in queue
That's not possible. You need to consume the topic to check existance for any event.
You'd need to consume and write to somewhere else (Redis, MongoDB, etc), then query an index on that to prevent duplicates with Kafka topics.
Or otherwise, embed this logic in your downstream consumers rather than worry about what is on the topic, or not, considering the fact that records are eventually removed from the topic due to retention policies.
May be ksqlDB can help. It converts stream into a queryable state.
Kafka has support for idempotence through Idempotent Producer / Exactly Once semantics. Means it ensures that messages published on Kafka topics should not be duplicated from the Producer side. For consumers.
flag: EnableIdempotence = true

Kafka topic filtering vs. ephemeral topics for microservice request/reply pattern

I'm trying to implement a request/reply pattern with Kafka. I am working with named services and unnamed clients that send messages to those services, and clients may expect a reply. Many (10s-100s) of clients may interact with a single service, or consumer group of services.
Strategy one: filtering messages
The first thought was to have two topics per service - the "HelloWorld" service would consume the "HelloWorld" topic, and produce replies back to the "HelloWorld-Reply" topic. Clients would consume that reply topic and filter on unique message IDs to know what replies are relevant to them.
The drawback there is it seems like it might create unnecessary work for clients to filter out a potentially large amount of irrelevant messages when many clients are interacting with one service.
Strategy two: ephemeral topics
The second idea was to create a unique ID per client, and send that ID along with messages. Clients would consume their own unique topic "[ClientID]" and services would send to that topic when they have a reply. Clients would thus not have to filter irrelevant messages.
The drawback there is clients may have a short lifespan, e.g. they may be single use scripts, and they would have to create their topic beforehand and delete it afterward. There might have to be some extra process to purge unused client topics if a client dies during processing.
Which of these seems like a better idea?
We are using Kafka in production as a handler for event based messages and request/response messages. our approach to implementing request/response is your first strategy because, when the number of clients grows, you have to create many topics which some of them are completely useless. another reason for choosing the first strategy was our topic naming guideline that each service should belong to only one topic for tacking. however, Kafka is not made for request/response messages but I recommend the first strategy because:
few numbers of topics
better service tracking
better topic naming
but you have to be careful about your consumer groups. which may causes of data loss.
A better approach is using the first strategy with many partitions in one topic (service) that each client sends and receives its messages with a unique key. Kafka guarantees that all messages with the same key will go to a specific partition. this approach doesn't need filtering irrelevant messages and maybe is a combination of your two strategies.
Update:
As #ValBonn said in the suggested approach you always have to be sure that the number of partitions >= number of clients.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.

Kafka instead of Rest for communication between microservices

I want to change the communication between (micro)-services from REST to Kafka.
I'm not sure about the topics and wanted to hear some opinions about that.
Consider the following setup:
I have an API-Gateway that provides CRUD functions via REST for web applications. So I have 4 endpoints which users can call.
The API-Gateway will produce the request and consumes the responses from the second service.
The second service consumes the requests, access the database to execute the CRUD operations on the database and produces the result.
How many topics should I create?
Do I have to create 8 (2 per endpoint (request/response)) or is there a better way to do it?
Would like to hear some experience or links to talks / documentation on that.
The short answer for this question is; It depends on your design.
You can use only one topic for all your operations or you can use several topics for different operations. However you must know that;
Your have to produce messages to kafka in the order that they created and you must consume the messages in the same order to provide consistency. Messages that are send to kafka are ordered within a topic partition. Messages in different topic partitions are not ordered by kafka. Lets say, you created an item then deleted that item. If you try to consume the message related to delete operation before the message related to create operation you get error. In this scenario, you must send these two messages to same topic partition to ensure that the delete message is consumed after create message.
Please note that, there is always a trade of between consistency and throughput. In this scenario, if you use a single topic partition and send all your messages to the same topic partition you will provide consistency but you cannot consume messages fast. Because you will get messages from the same topic partition one by one and you will get next message when the previous message consumed. To increase throughput here, you can use multiple topics or you can divide the topic into partitions. For both of these solutions you must implement some logic on producer side to provide consistency. You must send related messages to same topic partition. For instance, you can partition the topic into the number of different entity types and you send the messages of same entity type crud operation to the same partition. I don't know whether it ensures consistency in your scenario or not but this can be an alternative. You should find the logic which provides consistency with multiple topics or topic partitions. It depends on your case. If you can find the logic, you provide both consistency and throughput.
For your case, i would use a single topic with multiple partitions and on producer side i would send related messages to the same topic partition.
--regards

apache- kafka with 100 millions of topics

I'm trying to replace rabbit mq with apache-kafka and while planning, I bumped in to several conceptual planning problem.
First we are using rabbit mq for per user queue policy meaning each user uses one queue. This suits our need because each user represent some job to be done with that particular user, and if that user causes a problem, the queue will never have a problem with other users because queues are seperated ( Problem meaning messages in the queue will be dispatch to the users using http request. If user refuses to receive a message (server down perhaps?) it will go back in retry queue, which will result in no loses of message (Unless queue goes down))
Now kafka is fault tolerant and failure safe because it write to a disk.
And its exactly why I am trying to implement kafka to our structure.
but there are problem to my plannings.
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
Second, If I decide to go for topics based on operation and partition by random hash of users id, if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
So as conclusion, 1~5 millions users. We do not want to have one user blocking large number of other users being processed. Having topic per user will solve this issue, it seems like there might be an issue with zookeeper if such large number gets in (Is this true? )
what would be the best solution for structuring? Considering scalability?
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
I would advise against modeling like this.
Google around for "kafka topic limits", and you will find the relevant considerations for this subject. I think you will find you won't want to make millions of topics.
Second, If I decide to go for topics based on operation and partition by random hash of users id
Yes, have a single topic for these messages and then route those messages based on the relevant field, like user_id or conversation_id. This field can be present as a field on the message and serves as the ProducerRecord key that is used to determine which partition in the topic this message is destined for. I would not include the operation in the topic name, but in the message itself.
if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
This depends on how the users are consuming messages. You could set up a timeout, after which the message is routed to some "failed" topic. Or send messages to users in a UDP-style, without acks. There are many ways to model this, and it's tough to offer advice without knowing how your consumers are forwarding messages to your clients.
Also, if you are using Kafka Streams, make note of the StreamPartitioner interface. This interface appears in KStream and KTable methods that materialize messages to a topic and may be useful in a chat applications where you have clients idling on a specific TCP connection.