Is it possible to Create 2Topics with same partition? - apache-kafka

My Requirement is to Create 2 topics with same partition and if i produce messages to two different topics data can store in only one partition.
Is it Possible to create 2 Topics with Same Partition?
To acheive Multi-Tenancy, for example, multiple tenants are (Tenant-1, Tenant-2), For each Tenant, we have their specific Topics.
Tenant-1 ---> has Topic1, Topic2, Topic3
Tenant-2 ---> has Topic4, Topic5.
Looking to keep these Tenant specific data with in their single partition.
that is,
Topics1,2,3(records) ---> in partition-0, and Topics4,5(records) in partition-1
Is this possible, or what could be the best way to approach it..!

Having same partition for more than 1 topics is not possible in Kafka(even it is not possible in any system, IMO). The only major benefit you get from this approach is ordered data.
If you are not bothered about order of the data, then you can always have multiple topics per tenant and you can always consume from all those topics at the same time.
Hope this helps!

Related

Kafka consumer-group

I am newbie to Kafka and learning Kafka internals.. Please feel free to correct my understanding as required..
Here is my real time scenario.. appreciate all the responses:
I have a real time FTP server which receives data files.. Lets say claims files.
I will publish these data into a topic. lets call the topic as claims_topic (2 partitions).
I need to subscribe to this claims_topic, read the messages and write them to Oracle and Postgres table. Lets call oracle table as Otable and Postgres table as Ptable.
I need to capture every topic message and write them to Otable and Ptable. Basically Otable and Ptable has to be in sync.
Assume that I will write two consumers one for oracle and other for postgres.
Question1: Should the two consumers be in same consumer-group? I believe No. as it will lead to one consumer getting messages only from a particular partition.
Question2: If Question1 is TRUE. then please enlighten me in what case multiple consumers are grouped under a same consumer-group? real time scenario is much appreciated.
consumer group is a logical name that group an application consumers together, they are working together towards finish processing the data inside topic , each partition can be handled only by one consumer of consumer group, making partition count the maximum limit of parallel consumption/ processing power for a topic. each consumer in consumer group is handling one or more partitions , if you have one consumer on topic with many partitions it will handle all the partitions by itself, if you would add more consumers to the same consumer group they will divide / "rebalance" the topic partition among them , hope it clears things up
When setting up a consumer you configure its group id, this is the consumer group, two separate consumers with same group id are becoming members of the same consumer group
In cases where there is high produce throughout and one consumer can not handle the pressure you can scale it out by running more consumers with same consumer group to work together to process the topic , each task would take ownership on different partitions
For your use case complete sync of Postgres and Oracle won't be easily achievable, you could use kafka connect to read data from your topic to your targets with relevant sink connectors, but than again they will be "eventually consistent " as they do not share an atomic transaction
I would explore spring data transctional layer
Spring #Transactional with a transaction across multiple data sources
NO, Both consumers do not want to be in same consumer group, because they need to consume all topic data separately and write to Otable and Ptable.
If Both consumers are in one consumer group, then Otable getting data in one partition and Ptable getting data from other partition. (Because you have 2 partition)
In my opinion, use two consumers with two consumer group, then if there is high traffic in your topic, Then you can scale number of consumers separately for Otable and Ptable.
If you need two consumers to write Ptable, Use same group id for those consumers. Then consumer traffic will be shared with number of consumers. (in your case, maximum number of consumers for one group should be 2, because you have only 2 partitions in your topic). If you need this for Otable, follow the same scenario.

Should I create more topics or more partitions?

Kafka gets orders from others countries.
I need to group these orders by countries. Should I create more topics with country name or about to have one topic with different partitions?
Another was it to have one topic and use strean Kafka that filters orders and sends to specific country topic?
What is better if anmount of countries is over 180?
I want distribute orders across executers who is placed in specific country/city.
Remark:
So, order has data about country/city. Then Kafka must find executers in this country/city and send them the same order.
tl;dr
In your case, I would create one topic countries and use the country_id or country_name as the message key so that messages for the same country, are placed in the same partition. In this way, each partition will contain information for specific country (or countries - it depends).
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.

Splitting Kafka into separate topic or single topic/multiple partitions

As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.

Can we use two different topology for same topics in kafka

I have two events PingData and OrderEvent(coming from two different producers) and both of them are going to do some modification on same resource (db,cache)
What is the ideal way to deal with this, I am able to think about following two possiblities
1) single topics and topology and then filter on the basis of some data
2) Two different topics and two topologies which will perform some operations
Please guide me with the correct approach!
The real question here is how inter related your two sources are. Do you want to have strict ordering guarantees across these 2 streams of data ? If so, start a singe topic with just one partition and have a consumer consume from it.
Now, this is not a very god setup as there is little/no parallelism. If your data has some key based partitioning, you can hash it and then have multiple partitions per topic. Now, you will have ordering guarantees for a single partition but no such guarantee across partitions. But if they are logically separate entities, I believe you should be fine.
If there is no dependence between the two datasets you are producing (I mean no strict ordering constraints), you should ideally separate it into 2 different logical namespaces by creating 2 topics.
The key point to remember is that "topic is just a logical entity in Kafka". In terms of performance, a single topic with 2 partitions is the same as 2 topics with 1 partition each.
A lot will therefore depend on how you want to design your system and the relationship between the two sets.

Multiple topics on a single partition?

I was just curious and could not find any info on this. My question is can there be multiple topics on a single partition? If yes, how will they be produced in that partition or consumed by a consumer later? Or is it that one partition always holds one topic?
Kafka partition belongs only to one topic. Topic is higher level construct which is broken into partition, so there is guarantee that single partition never belongs to more than one topic.
In Kafka, one partition always holds data related to one topic. Having multiple topics data in one partition is little unusual use-case. If I understood your use-case correctly, If you want to store multiple datasets in one topic and one partition (which is not recommended though) then you can create flag field in input data which reveals that document belongs to particular dataset.
Hope This Helps!