We read data from brokers through multiple consumers using consumer group, but how the consumed data is combined? - apache-kafka

I need data from kafka brokers,but for fast access I am using multiple consumers with same group id known as consumer groups.But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?

By design, different consumers in the same consumer group process data independently from each other. (This behavior is what allows applications to scale well.)
But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
The short but slightly simplified answer when you use Kafka's "Consumer API" (also called: "consumer client" library), which I think is what you are using based on the wording of your question: If you need to combine data from multiple consumers, the easiest option is to make this (new) input data available in another Kafka topic, where you do the combining in a subsequent processing step. A trivial example would be: the other, second Kafka topic would be set up to have just 1 partition, so any subsequent processing step would see all the data that needs to be combined.
If this sounds a bit too complicated, I'd suggest to use Kafka's Streams API, which makes it much easier to define such processing flows (e.g. joins or aggregations, like in your question). In other words, Kafka Streams gives you a lot of the desired built-in "logic" that you are looking for: https://kafka.apache.org/documentation/streams/

The aim of Kafka is to provide you with a scalable, performant and fault tolerant framework. Having a group of consumers reading the data from different partitions asynchronously allows you to archive first two goals. The grouping of the data is a bit outside the scope of standard Kafka flow - you can implement a single partition with single consumer in most simple case but I'm sure that is not what you want.
For such things as aggregation of the single state from different consumers I would recommend you to apply some solution designed specifically for such sort of goals. If you are working in terms of Hadoop, you can use Storm Trident bolt which allows you to aggregate the data from you Kafka spouts. Or you can use Spark Streaming which would allow you to do the same but in a bit different fashion. Or as an option you can always implement your custom component with such logic using standard Kafka libraries.


Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Kafka stream vs kafka consumer how to make decision on what to use

I have worked on some Kafka stream application and Kafka consumer application. In the end, Kafka stream is nothing but consumer which consumes real-time events from Kafka. So I am not able to figure out when to use Kafka streams or why we should use Kafka streams as we can perform all transformation on the consumer end.
I want to understand the main difference between Kafka stream and Kafka consumer as implementation wise and how to make a decision about what we should use in different use cases.
Thanks in advance for answers.
It's a question about "easy of use" (or simplicity) and "flexibility". The two "killer features" of Kafka Streams, compared to plain consumer/producer are:
built-in state handling, and
exactly-once processing semantics.
Building a stateful, fault-tolerant application or using Kafka transactions with plain consumers/producers is quite difficult to get right. Furthermore, the higher level DSL provides a lot of built-in operators that are hard to build from scratch, especially:
windowing and
joins (stream-stream, stream-table, table-table)
Another nice feature is punctuations.
However, even if you build a simple stateless application, using Kafka Streams can help you significantly to reduce you code base (ie, avoid boilerplate code). Hence, the recommendation is, to use Kafka Streams when possible and only fall back to consumer/producer if Kafka Streams is not flexible enough for your use case.
It's different ways to do the same thing, with different levels of abstraction and functionality.
Here's a side-by-side comparison of doing the same thing (splitting a string into two separate fields) in Kafka vs in Kafka Streams (for good measure it shows doing it in ksqlDB too)

Assign different group id to different consumers in the same application

I am aware of the parallelism advantages that kafka streams offer which are leveraged if your parallelism needs are aligned with the partitioning of the topics.
I am considering having an application subscribe many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
Specifically I am thinking of having multiple threads consume the same topic to provide different results even though I know that I can express all my computation needs using the "chaining" computation paradigm that KStreams offer.
The reason why I am considering different threads is because I want multiple dynamically created KTable instances of the stream. Each one working on the same stream (not subset) and aggregating different results. Since it's dynamic it can create really heavy load that could be alleviated by adding thread parallelism. I believe the idea that each thread can work on its own streams instance (and consumer group) is valid.
Of course I can also add thread parallelism by having multiple threads consuming smaller subsets of the data and individually doing all the computations (e.g. each one maintaining subsets of all the different KTables) which will still provide concurrency.
So, two main points in my question
Are KafkaStreams not generally suited for thread parallelism, meaning is the library not intended to be used that way?
In the case where threads are being used to consume a topic would it be a better idea to make threads follow the general kafka parallelism concept of working on different subsets of the data, therefore making thread parallelism an application-level analogous to scaling up using more instances?
But I am wondering would it be okay to have an application that subscribes many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
What you could consider is running multiple KafkaStreams instances inside the same Java application. Each instance has its own StreamsConfig and thus its own application.id and consumer group id.
That said, depending on what your use case is, you might want to take a look at GlobalKTable (http://docs.confluent.io/current/streams/concepts.html#globalktable), which (slightly simplified) ensures that the data it reads from a Kafka topic is available in all instances of your Kafka Streams application. That is, this would allow you to "replicate the data globally" without having to run multiple KafkaStreams instances or the more complicated setup you asked about above.
Specifically I am considering having multiple threads consume the same topic to provide different kinds of results. Can I somehow define the consumer group that each KafkaStream consumer is listening to?
Hmm, perhaps you're looking at something else then.
You are aware that you can build multiple "chains" of computation from the same KStream and KTable instance?
KStream<String, Long> input = ...;
KTable<..., ...> firstChain = input.filter(...).groupByKey().count(...);
KTable<..., ...> secondChain = input.mapValues(...);
This would allow you to read a Kafka topic once but then compute different outcomes based on that topic.
Is this considered a bad idea in general?
If I understand you correctly I think there's a better and much simpler approach, see above. If you need something different, you may need to update/clarify your question.
Hope this helps!

Implement filering for kafka messages

I have started using Kafka recently and evaluating Kafka for few use cases.
If we wanted to provide the capability for filtering messages for consumers (subscribers) based on message content, what is best approach for doing this?
Say a topic named "Trades" is exposed by producer which has different trades details such as market name, creation date, price etc.
Some consumers are interested in trades for a specific markets and others are interested in trades after certain date etc. (content based filtering)
As filtering is not possible on broker side, what is best possible approach for implementing below cases :
If filtering criteria is specific to consumer. Should we use
Consumer-Interceptor (though interceptor are suggested for logging
purpose as per documentation)?
If filtering criteria (content based filtering) is common among consumers, what should be the approach?
Listen to topic and filter the messages locally and write to new topic (using either interceptor or streams)
If I understand you question correctly, you have one topic and different consumer which are interested in specific parts of the topic. At the same time, you do not own those consumer and want to avoid that those consumer just read the whole topic and do the filtering by themselves?
For this, the only way to go it to build a new application, that does read the whole topic, does the filtering (or actually splitting) and write the data back into two (multiple) different topics. The external consumer would consumer from those new topics and only receive the date they are interested in.
Using Kafka Streams for this purpose would be a very good way to go. The DSL should offer everything you need.
As an alternative, you can just write your own application using KafkaConsumer and KafkaProducer to do the filtering/splitting manually in your user code. This would not be much different from using Kafka Streams, as a Kafka Streams application would do the exact same thing internally. However, with Streams your effort to get it done would be way less.
I would not use interceptors for this. Even is this would work, it seems not to be a good software design for you use case.
Create your own interceptor class that implements org.apache.kafka.clients.consumer.ConsumerInterceptor and implement your logic in method 'onConsume' before setting 'interceptor.classes' config for the consumer.

Apache Kafka multiple groups

I was reading about Apache Kafka, and I came across it's concept of consumer groups. What I don't understand is its use case. Two different consumers from different groups may read the same message being published. Why would one want to process the same message by two different consumers? Can someone give a practical use case?
You want to write the data to MySQL and to Elastic Search and you have an application that reads events and flags some as "errors".
Each one of these use-cases is a separate application that will want to see all the events so they will be separate consumer groups and each will see all messages.
This is actually the most typical scenario in Kafka: an application produces a message and you have two different systems creating two different views on that data (i.e. indexing it in ES and caching it in Redis).
Before Kafka it was common to have your app dual-writing its data into both apps, with all the problems dual writes carry in terms of consistency.
With Kafka you can spin off as many consumer systems in the form of groups and also have parallelisation and fault tolerant having multiple partitions and consumer instances within the group.