Is there any difference between KafkaConsumer and KafkaStreams? [duplicate] - apache-kafka

This question already has answers here:
Kafka: Consumer API vs Streams API
(3 answers)
Closed 5 years ago.
I'm using Apache Kafka 0.8.2.1, planning to upgrade my application to use Apache kafka 1.0.0.
While I inspect about Kafka Streams, I got some question about difference between KafkaConsumer and KafkaStreams.
Basically, KafkaConsumer have to consume from broker using polling method. I can specify some duration while polling, and whenever I got ConsumerRecored I can handle it to product some useful information.
KafkaStream, on the other hand, I don't have to specify any polling duration but just call start() method.
I know that KafkaConsumer basically used to consume literally, from broker and KafkaStreams can do various thing like Map-Reduce or interact with database, even re-produce to other kafka or any other systems.
So, there is my question. Is there any difference between KafkaConsumer and KafkaStream basically(in other words, When it comes to level of apache kafka library.)?

Yes, there is the difference between Kafka Consumer and Kafka Streams.
Kafka Consumer can be used at receiving end to receive data and process for future computation(based on topic and partition)
Kafka Streams API to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.

As you've said, they offer different functionalities:
KafkaStreams allows to perform complex processing on records
KafkaConsumer allows to receive records from a Kafka Cluster
KafkaStreams uses regular KafkaConsumers and KafkaProducers clients under the cover in order to retrieve records and send the results of processing to the brokers. It uses predefined values for many configurations but still exposes a lot of client configurations.
KafkaStreams is a regular (although pretty advanced) Kafka application using the Kafka clients (Consumer and Producer). Its APIs allow higher level applications to focus on the business logic and not on the Kafka details.
Also being part of the Apache Kafka distribution, it's using best practices and tricks to make the most of Kafka.

Related

Improve performance by using Flink instead of Kafka Streams when Source and Sink are in Kafka?

Assuming I have input data coming in via Kafka topics, and output data to be sent to Kafka topics as well, under what circumstances would Flink be able to process data faster than Kafka Streams? At least when it comes to the time spent consuming and producing, I would not expect Flink to be any faster than Kafka Streams.
Both Flink and Kafka Streams are built on top of the same Producer and Consumer API, so they'll act similarly, up to a point. Once you get into the specific API/DSL, then the stacktrace gets more nested.
Outside of record serialization, Flink can perform more tasks like using Flink SQL compared to Kafka's KSQL, but in those cases, you're managing an external cluster.
Personally, I find Kafka Streams to be faster to develop and maintain because the application itself is the deployable unit, not something to submit to a pool of resources that might be preempted by some scheduler. But if you want to use more than a JVM language, then you will need to venture into Flink or even Beam. And those other languages will be slower because the code will then interface with those native Java libraries.

how do i test Exactly Once Semantics working in my kafka streams application

i have a Kafka Streams DSL application, we have a requirement on exactly once processing, for the same i have added the configuration
streamConfig.put(processing.gurantee, "exactly_once");
I am using kafka 2.7
I have 2 queries
what's the difference between exactly_once and exactly_once_beta
how do i test this functionality to be sure my messages are getting processed only once
Thanks!
exactly_once_beta is an improvement over exactly_once. While exactly_once uses a transactional producer for each stream task (combination of sub-topology and input partition, exactly_once_beta uses a transactional producer for each stream thread of a Kafka Streams client.
Every producer comes with separate memory buffers, a separate thread, separate network connections which might limit scaling the number of input partitions (i.e. number of tasks). A high number of producers might also cause more load on the brokers. Hence, exactly_once_beta has better scaling characteristics. You can find more details in KIP-447.
Note that exactly_once will be deprecated and exactly_once_beta will be renamed to exactly_once_v2 in Apache Kafka 3.0. See KIP-732 for more details.
For tests you can get inspiration from the tests in the Apache Kafka repo:
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EosIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EOSUncleanShutdownIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/streams/streams_eos_test.py
Basically, you need to create a failover scenario and verify that messages are not produced multiple times to the output topics. Note that messages may be processed multiple times, but the results in the output topics must appear as if they were only processed once. You can find a pretty good talk about exactly-once semantics that also explains the failover scenarios here: https://www.confluent.io/kafka-summit-london18/dont-repeat-yourself-introducing-exactly-once-semantics-in-apache-kafka/

Kafka stream vs kafka consumer how to make decision on what to use

I have worked on some Kafka stream application and Kafka consumer application. In the end, Kafka stream is nothing but consumer which consumes real-time events from Kafka. So I am not able to figure out when to use Kafka streams or why we should use Kafka streams as we can perform all transformation on the consumer end.
I want to understand the main difference between Kafka stream and Kafka consumer as implementation wise and how to make a decision about what we should use in different use cases.
Thanks in advance for answers.
It's a question about "easy of use" (or simplicity) and "flexibility". The two "killer features" of Kafka Streams, compared to plain consumer/producer are:
built-in state handling, and
exactly-once processing semantics.
Building a stateful, fault-tolerant application or using Kafka transactions with plain consumers/producers is quite difficult to get right. Furthermore, the higher level DSL provides a lot of built-in operators that are hard to build from scratch, especially:
windowing and
joins (stream-stream, stream-table, table-table)
Another nice feature is punctuations.
However, even if you build a simple stateless application, using Kafka Streams can help you significantly to reduce you code base (ie, avoid boilerplate code). Hence, the recommendation is, to use Kafka Streams when possible and only fall back to consumer/producer if Kafka Streams is not flexible enough for your use case.
It's different ways to do the same thing, with different levels of abstraction and functionality.
Here's a side-by-side comparison of doing the same thing (splitting a string into two separate fields) in Kafka vs in Kafka Streams (for good measure it shows doing it in ksqlDB too)

Is Kafka a message queue and can Kafka be used as the database?

Some places mentioned Kafka is the publish-subscribe messaging. Other sources mentioned Kafka is the Message Queue. May I ask the differences between those and can Kakfa be used as the database?
There are 2 patterns named Publish-Subscribe and Message Queue. There are some places discussed the differences. here
Kafka especially supports both of these 2 patterns. For the publish-subscribe pattern, Kafka has publisher/subscriber which supported this pattern. The publisher sends messages to one topic and the subscriber can subscribes and receives messages on that one. For the queueing pattern, Kafka has a concept named Consumer Group. Within the same consumer group, all consumers will share jobs hence balancing the workload.
Because of the flexible design from the start, Kafka is broadly used for many software patterns while designing the system.
Personally, I would not call Kafka itself a database but you can use Kafka as the storage, especially through some mechanisms such as the log compaction. Ref1 Ref2
Kafka is a storage at base like a database but without indexes, where every query is a full scan of your data. Kafka it store data in files that can not be modified. Ex if you use event sourcing you can save all event of your system in Kafka and reprocess all events when your system have a bug.
Imagine that Kafka can split a very huge file(10TB or more) on multiple server and provide a way to read that file in a distributed manner using partitions( more partition you have, more application can read in parallel).
Because its a storage, Kafka can also be used as a message queue or as a publish-subscribe system.

Kafka Streams use case

I am building a simple application which does below in order -
1) Reads messages from a remote IBM MQ(legacy system only works with IBM MQ)
2) Writes these messages to Kafka Topic
3) Reads these messages from the same Kafka Topic and calls a REST API.
4) There could be other consumers reading from this topic in future.
I came to know that Kafka has the new streams API which is supposed to be better than Kafka consumer in terms of speed/simplicity etc. Can someone please let me know if the streams API is a good fit for my use case and at what point in my process i can plug it ?
It is true that Kafka Streams API has a simple way to consume records in comparison to Kafka Consumer API (e.g. you don't need to poll, manage a thread and loop), but it also comes with a cost (e.g. local data store - if you do stateful processing).
I would say that if you need to consume records one by one and call a REST API use the Consumer API, if you need stateful processing, query the topic state, etc. use the Streams API.
For more info take a look to this blog post: https://balamaci.ro/kafka-streams-for-stream-processing/
Reads messages from a remote IBM MQ (legacy system only works with
IBM MQ)
Writes these messages to Kafka Topic
I'd use Kafka Connect for (1) and (2). It is part of the Kafka project, and there are many free as well as commercial "connectors" available for hundreds of systems.
Reads these messages from the same Kafka Topic and calls a REST API.
You can use Kafka Streams as well as the lower-level Consumer API of Kafka, depending on what you prefer. I'd go with Kafka Streams as it is easier to use and far more powerful. (Both are part of the Kafka project.)
There could be other consumers reading from this topic in future.
This works out-of-the-box -- once data is stored in a Kafka topic according to step 2, many different applications and "consumers" can read this data independently.
Looks like you are not doing any processing/transformation once you consume you message from your IBM MQ or even after your Kafka Topic.
First one -> from IBM Mq to your Kafka Topic is kind of a pipeline and
Secondly -> You are just calling the REST API(I assume w/o any processing)
Considering these facts it seems to be a good fit for using Simple consumer.
Let's not use a technology only because it's there :)