Kafka: Can a single producer produce 2 different records to 2 different topics? - apache-kafka

I have two types of records, let's call it X and Y. I want record X to go to TopicX and record Y to go to TopicY.
1) Do I need two different producers?
2) Is it better to have 2 partitions instead of 2 different topics?
3) How can I avoid having two different producers for better network usage.
Thank you!

if you are using the same key/value serializer (and other producer properties), you can use the same producer. Producer record contains information about topic to be send
common practice is to have topic per message type. For partitionion some ids are used (clientId, sessionId... ). So, if records you want to send have different logic, than it is better to use different topics.

Related

Apache Kafka Data Status ( Message Status )

I am a beginner in kafka and is trying to create a chat application with the features like forward,read,delivered.
Let me give my approaches first so that you would have an idea on whether I am going on the right path,
Approach 1:
Define Topic 'some_name' having 3 partitions.
These partitions denotes the below,
Partition 1 : Send
Partition 2 : Delivered
Partition 3 : Read
Here the messages will go through 1st partiton then once client provides a call back we dequeue it from the first queue and enqueue it to the 2nd and so on for the read part
Approach 2:
In this approach it would be just 1 topic and a partition and if Kafka provides a flag for each data present ( flag denoting if it has been consumed by any consumer ) , I can set that flag for read/delievered.
What I have tried:
I have tried going with the first approach as by maintaining 3 partitions but on the consumer side. I wasn't able read data from all 3 partitions together instead returning null.
These are the approaches that I have in mind and looking forward to explore more. I could really use a help on new approaches or the best way to overcome this.
Thanks.
I'm not sure what you mean by consumer returning nulls. The default behavior of subscribing to a topic is getting assigned all partitions.
But you cannot "dequeue" data and move records across topic partitions, regardless, so you may want to reconsider your design, such as using the Transactional Outbox Pattern.

Distribute messages on single Kafka topic to specific consumer

Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.
I want to know how typically this is solved when using avro on Kafka:
leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
make use of kafka header to inject identifier for downstream consumers to filter
Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.
It depends...
If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.
Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.
Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.
Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.
The data in the topic is just bytes, Avro shouldn't matter.
Since you only have one partition, only one consumer of a group can be actively reading the data.
If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets

Splitting Kafka into separate topic or single topic/multiple partitions

As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.

Can we use two different topology for same topics in kafka

I have two events PingData and OrderEvent(coming from two different producers) and both of them are going to do some modification on same resource (db,cache)
What is the ideal way to deal with this, I am able to think about following two possiblities
1) single topics and topology and then filter on the basis of some data
2) Two different topics and two topologies which will perform some operations
Please guide me with the correct approach!
The real question here is how inter related your two sources are. Do you want to have strict ordering guarantees across these 2 streams of data ? If so, start a singe topic with just one partition and have a consumer consume from it.
Now, this is not a very god setup as there is little/no parallelism. If your data has some key based partitioning, you can hash it and then have multiple partitions per topic. Now, you will have ordering guarantees for a single partition but no such guarantee across partitions. But if they are logically separate entities, I believe you should be fine.
If there is no dependence between the two datasets you are producing (I mean no strict ordering constraints), you should ideally separate it into 2 different logical namespaces by creating 2 topics.
The key point to remember is that "topic is just a logical entity in Kafka". In terms of performance, a single topic with 2 partitions is the same as 2 topics with 1 partition each.
A lot will therefore depend on how you want to design your system and the relationship between the two sets.

Is it possible to Create 2Topics with same partition?

My Requirement is to Create 2 topics with same partition and if i produce messages to two different topics data can store in only one partition.
Is it Possible to create 2 Topics with Same Partition?
To acheive Multi-Tenancy, for example, multiple tenants are (Tenant-1, Tenant-2), For each Tenant, we have their specific Topics.
Tenant-1 ---> has Topic1, Topic2, Topic3
Tenant-2 ---> has Topic4, Topic5.
Looking to keep these Tenant specific data with in their single partition.
that is,
Topics1,2,3(records) ---> in partition-0, and Topics4,5(records) in partition-1
Is this possible, or what could be the best way to approach it..!
Having same partition for more than 1 topics is not possible in Kafka(even it is not possible in any system, IMO). The only major benefit you get from this approach is ordered data.
If you are not bothered about order of the data, then you can always have multiple topics per tenant and you can always consume from all those topics at the same time.
Hope this helps!