Can we use two different topology for same topics in kafka - apache-kafka

I have two events PingData and OrderEvent(coming from two different producers) and both of them are going to do some modification on same resource (db,cache)
What is the ideal way to deal with this, I am able to think about following two possiblities
1) single topics and topology and then filter on the basis of some data
2) Two different topics and two topologies which will perform some operations
Please guide me with the correct approach!

The real question here is how inter related your two sources are. Do you want to have strict ordering guarantees across these 2 streams of data ? If so, start a singe topic with just one partition and have a consumer consume from it.
Now, this is not a very god setup as there is little/no parallelism. If your data has some key based partitioning, you can hash it and then have multiple partitions per topic. Now, you will have ordering guarantees for a single partition but no such guarantee across partitions. But if they are logically separate entities, I believe you should be fine.
If there is no dependence between the two datasets you are producing (I mean no strict ordering constraints), you should ideally separate it into 2 different logical namespaces by creating 2 topics.
The key point to remember is that "topic is just a logical entity in Kafka". In terms of performance, a single topic with 2 partitions is the same as 2 topics with 1 partition each.
A lot will therefore depend on how you want to design your system and the relationship between the two sets.

Related

In Kafka Streams, how do you parallelize complex operations (or sub-topologies) using multiple topics and partitions?

I am currently trying to understand how Kafka Streams achieves parallelism. My main concern boils down to three questions:
Can multiple sub-topologies read from the same partition?
How can you parallelise a complex operation (making up a sub-topology) that uses the processor API and requires reading the entire topic?
Can multiple sub-topologies read from the same topic (such that independent and expensive operations on the same topic can be run in different sub-topologies)?
As the developer, we don't have direct control about how topologies are divided into sub-topologies. Kafka Streams divides the Topology into multiple sub-topologies using Topics as a "bridge" where possible. Additionally, multiple stream tasks are created that each read a subset of data from the input topic, divided by partition. The documentation reads:
Slightly simplified, the maximum parallelism at which your application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.
Assume there was a sub-topology that reads multiple input topics whose number of partitions are not identical. If the above excerpt of the documentation is to be believed, then one or more partitions of the topic that has less partitions would need to be assigned to multiple stream tasks (if both topics need to be read for the logic to work). However, this should not be possible, because, as I understand it, multiple instances of the streams application (each sharing the same application id) act as one Consumer group, where each partition is only assigned once. In such a case, the number of tasks being created for a sub-topology should actually be limited by the minimum number of partitions of its input topics, i.e. a single partition is only assigned to one Task.
I am not sure if the initial problem, i.e. a non-co-partitioned sub-topology would actually occur. If there is an operation that requires to read both input topics, the data would probably need to be co-partitioned (like in Joins).
Say there was an expensive operation between two topics (possibly built from multiple custom processors) that requires the data of one topic to always be available in its entirety. You would want to parallelise this operation into multiple tasks.
If the topic had just one partition, and a partition could be read multiple times, this would not be a problem. However, as discussed earlier, I don't believe this to work.
Then there are GlobalKTables. However, there is no way to use GlobalKTables with custom processors (toStream is not available).
Another idea would be to broadcast the data into multiple partitions, essentially duplicating it by the partition count. This way, multiple stream tasks could be created for the topology to read the same data. To do this, a custom partitioner could be specified in the Produced-Instance given to KStream#to. If this data duplication can be accepted, this seems to be the only way to achieve what I have in mind.
Regarding question number three, because the Streams application is one Consumer group, I would also expect this to not be possible. With my current understanding, this would require to write the data into multiple identical topics (again essentially duplicating the data), such that independent sub-topologies can be created. An alternative would be to run separate streaming applications (such that a different consumer group is used).
Without seeing your topology definition, this is a somewhat vague question. You can have repartition and changelog topics. These are duplicated data from the original input topic.
But stateless operators like map, filter, etc. pass data through from the same (assigned) partitions for each thread.
A "sub topology" is still part of only one application.id, thus one consumer group, so no, it cannot read the same topic partitions more than once. For that, you'd need independent streams/tables via branching operations within the whole topology, for example, filtering numbers by even and odd only consumes the topic once; you don't need to "broadcast" records to all partitions, and I'm not sure that's even possible out of the box (to sends one-to-one, and Produced defines serialization, not multiple partitions). If you need to cross reference different operators, then you can use join / statestores / KTables.
None of this is really related to parallelism. You have num.stream.threads, or you can run multiple instances of the same JVM process to scale.

What are the advantages and disadvantages of using several Kafka topics compared to using a big Kafka topic?

I cannot find literature on this, just found a couple of articles that doesn't make it any clearer.
For example, let's say I want to send tweets in streaming, I could create a big kafka topic called 'tweets' and use it to send all the tweets, but it could also be possible to create several smaller kafka topics for the different subjects that the tweet is about: dogs, cats, horses, etc. (imagine that these subjects are relevant for the project).
What would be the advantages and disadvantages of using several smaller Kafka topics instead of using a general Kafka topic?
imagine that these subjects are relevant for the project
That is the most important determination. Otherwise, you end up with millions of topics for every possible word, and multiplying that by different language support? That will not scale.
There is a middle-ground, too - routing specific messages to certain partitions.
The only real deciding factor is that keyed records should end-up in the same partition to be ordered.
There is also an upper-bound on the number of topics that a Kafka cluster can reasonably support.
The upper-limit on number of partitions for one topic is generally considered higher than the topics the cluster can store.

Should I create more topics or more partitions?

Kafka gets orders from others countries.
I need to group these orders by countries. Should I create more topics with country name or about to have one topic with different partitions?
Another was it to have one topic and use strean Kafka that filters orders and sends to specific country topic?
What is better if anmount of countries is over 180?
I want distribute orders across executers who is placed in specific country/city.
Remark:
So, order has data about country/city. Then Kafka must find executers in this country/city and send them the same order.
tl;dr
In your case, I would create one topic countries and use the country_id or country_name as the message key so that messages for the same country, are placed in the same partition. In this way, each partition will contain information for specific country (or countries - it depends).
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.

Splitting Kafka into separate topic or single topic/multiple partitions

As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.

Kafka streams - Multiple topics as same source or one topic per source?

When building a Kafka Streams topology, reads from multiple topics can be modeled in two different ways:
Read all topics with the same source node.
topologyBuilder.addSource("sourceName", ..., "topic1", "topic2", "topic3");
Read each topic using a separate source node.
topologyBuilder.addSource("sourceName1", ..., "topic1")
.addSource("sourceName2", ..., "topic2")
.addSource("sourceName3", ..., "topic3");
Is there a relative advantage of option1 over option2 or vice versa? All topics contain the same type of data and have the same data processing logic.
Given that, as you state, all input topics contain the same kind of data and subsequent processing of the data is equivalent, you should most probably go with option 1, for the following two reasons:
1) this will result in a smaller topology
2) you would only need to connect one source node to your subsequent processing steps
In case processing will need to be different for the different source topics at a later point in time, you could then split up the source node into multiple ones.
There are several other factors to consider.
If your input data is uniformly distributed between input topics (by the size and the rate of messages), then go for option 1, because of its simplicity.
If not, then the "slow" topics will slow down your overall consumption, so to achieve smaller delays on "fast" topics go for option 2.
If you run several such topologies in parallel on different nodes (for high availability or high throughput), then having one consumer group (option 1) will result in more consumers to coordinate within it. In my experience this also slows down consumption, especially when you restart consumers (or if they fall out). In this case I also go for option 2: less consumers in a group require less effort to coordinate, shorter delays.