Apache Kafka multiple groups - apache-kafka

I was reading about Apache Kafka, and I came across it's concept of consumer groups. What I don't understand is its use case. Two different consumers from different groups may read the same message being published. Why would one want to process the same message by two different consumers? Can someone give a practical use case?

You want to write the data to MySQL and to Elastic Search and you have an application that reads events and flags some as "errors".
Each one of these use-cases is a separate application that will want to see all the events so they will be separate consumer groups and each will see all messages.

This is actually the most typical scenario in Kafka: an application produces a message and you have two different systems creating two different views on that data (i.e. indexing it in ES and caching it in Redis).
Before Kafka it was common to have your app dual-writing its data into both apps, with all the problems dual writes carry in terms of consistency.
With Kafka you can spin off as many consumer systems in the form of groups and also have parallelisation and fault tolerant having multiple partitions and consumer instances within the group.

Related

Splitting Kafka into separate topic or single topic/multiple partitions

As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.

We read data from brokers through multiple consumers using consumer group, but how the consumed data is combined?

I need data from kafka brokers,but for fast access I am using multiple consumers with same group id known as consumer groups.But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
By design, different consumers in the same consumer group process data independently from each other. (This behavior is what allows applications to scale well.)
But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
The short but slightly simplified answer when you use Kafka's "Consumer API" (also called: "consumer client" library), which I think is what you are using based on the wording of your question: If you need to combine data from multiple consumers, the easiest option is to make this (new) input data available in another Kafka topic, where you do the combining in a subsequent processing step. A trivial example would be: the other, second Kafka topic would be set up to have just 1 partition, so any subsequent processing step would see all the data that needs to be combined.
If this sounds a bit too complicated, I'd suggest to use Kafka's Streams API, which makes it much easier to define such processing flows (e.g. joins or aggregations, like in your question). In other words, Kafka Streams gives you a lot of the desired built-in "logic" that you are looking for: https://kafka.apache.org/documentation/streams/
The aim of Kafka is to provide you with a scalable, performant and fault tolerant framework. Having a group of consumers reading the data from different partitions asynchronously allows you to archive first two goals. The grouping of the data is a bit outside the scope of standard Kafka flow - you can implement a single partition with single consumer in most simple case but I'm sure that is not what you want.
For such things as aggregation of the single state from different consumers I would recommend you to apply some solution designed specifically for such sort of goals. If you are working in terms of Hadoop, you can use Storm Trident bolt which allows you to aggregate the data from you Kafka spouts. Or you can use Spark Streaming which would allow you to do the same but in a bit different fashion. Or as an option you can always implement your custom component with such logic using standard Kafka libraries.

Hint about kafka cluster setup

I have the following scenario:
4 wearable sensors attached on individuals.
Potentially infinite individuals.
A Kafka cluster.
I have to perform real-time processing on data streams on a cluster with a running instance of apache flink.
Kafka is the data hub between flink cluster and sensors.
Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.
I imagine this setup in my mind:
I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person.
In this way I though to establish a consumer group for every topic.
Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...
My questions are:
Is this setup good? What do you think about it?
In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?
Destroy me guys,
and thanks in advance
You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.
Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)

Assign different group id to different consumers in the same application

I am aware of the parallelism advantages that kafka streams offer which are leveraged if your parallelism needs are aligned with the partitioning of the topics.
I am considering having an application subscribe many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
Specifically I am thinking of having multiple threads consume the same topic to provide different results even though I know that I can express all my computation needs using the "chaining" computation paradigm that KStreams offer.
The reason why I am considering different threads is because I want multiple dynamically created KTable instances of the stream. Each one working on the same stream (not subset) and aggregating different results. Since it's dynamic it can create really heavy load that could be alleviated by adding thread parallelism. I believe the idea that each thread can work on its own streams instance (and consumer group) is valid.
Of course I can also add thread parallelism by having multiple threads consuming smaller subsets of the data and individually doing all the computations (e.g. each one maintaining subsets of all the different KTables) which will still provide concurrency.
So, two main points in my question
Are KafkaStreams not generally suited for thread parallelism, meaning is the library not intended to be used that way?
In the case where threads are being used to consume a topic would it be a better idea to make threads follow the general kafka parallelism concept of working on different subsets of the data, therefore making thread parallelism an application-level analogous to scaling up using more instances?
But I am wondering would it be okay to have an application that subscribes many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
What you could consider is running multiple KafkaStreams instances inside the same Java application. Each instance has its own StreamsConfig and thus its own application.id and consumer group id.
That said, depending on what your use case is, you might want to take a look at GlobalKTable (http://docs.confluent.io/current/streams/concepts.html#globalktable), which (slightly simplified) ensures that the data it reads from a Kafka topic is available in all instances of your Kafka Streams application. That is, this would allow you to "replicate the data globally" without having to run multiple KafkaStreams instances or the more complicated setup you asked about above.
Specifically I am considering having multiple threads consume the same topic to provide different kinds of results. Can I somehow define the consumer group that each KafkaStream consumer is listening to?
Hmm, perhaps you're looking at something else then.
You are aware that you can build multiple "chains" of computation from the same KStream and KTable instance?
KStream<String, Long> input = ...;
KTable<..., ...> firstChain = input.filter(...).groupByKey().count(...);
KTable<..., ...> secondChain = input.mapValues(...);
This would allow you to read a Kafka topic once but then compute different outcomes based on that topic.
Is this considered a bad idea in general?
If I understand you correctly I think there's a better and much simpler approach, see above. If you need something different, you may need to update/clarify your question.
Hope this helps!

Desigining Kafka Topics - Many Topics vs One Big Topic

Considering a stream of different events the recommended way would be
one big topic containing all events
multiple topics for different types of events
Which option would be better?
I understand that messages not being in the same partition of a topic it means there are no order guarantee, but are there any other factors to be considered when making this decision?
A topic is a logical abstraction and should contain message of the same type. Let's say, you monitor a website and capture click stream events and on the other hand you have a database that populates it's changes into a changelog topics. You should have two different topics because click stream events are not related to you database changelog.
This has multiple advantages:
your data will have different format und you will need different (de)serializers to write read the data (using a single topic you would need a hybrid serializer and you will not get type safety when reading data)
you will have different consumer application and one application might be interested in click stream events only, while a second application is only interested in the database changelog and a third application is interested in both. If you have multiple topics, application one and two only subscribe to the topics they are interesting in -- if you have a single topic, application one an two need to read everything and filter the stuff they are not interested in increasing broker, network, can client load
As #Matthias J. Sax told before there is not a golden bullet over here. But we have to take different topics into account.
The conditioner: ordered deliveries
If you application needs guarantee order delivery, you need to work with only one topic, plus same keys for those messages which need to guarantee it.
If ordering is not mandatory, the game starts...
Does the schema same for all messages?
Would be consumers interested in the same type of different events?
What is gonna happen at the consumer side?, do we are reducing or increasing complexity in terms of implementation, maintainability, error handling...?
Does horizontal scalability important for us? More topics often means more partitions available, which means more horizontal scalability capacity. Also it allows more accurate scalability configuration at the broker side, because we can choose what number of partitions to increase per event type. or at the consumer side, what number of consumers stand up per event type.
Does makes sense parallelising consumption per message type?
...
Technically speaking, if we allow consumers to fine tune those type of events to be consumed we're potentially reducing the network bandwidth required to send undesired messages from the broker to the consumer, plus the number deserialisations for all of them (cpu used, which makes along time more free resources, energy cost reduction...).
Also is worthy to remember that splitting different type of messages in different topics doesn't mean have to consume them with different Kafka consumers because they allow consumption from different topics at the same time.
Well, there's not a clear answer for this question, but I have the feeling that with Kafka, because multiple features, if ordered deliveries are not needed we should split our messages per type in different topics.