Kafka Streams use case - apache-kafka

I am building a simple application which does below in order -
1) Reads messages from a remote IBM MQ(legacy system only works with IBM MQ)
2) Writes these messages to Kafka Topic
3) Reads these messages from the same Kafka Topic and calls a REST API.
4) There could be other consumers reading from this topic in future.
I came to know that Kafka has the new streams API which is supposed to be better than Kafka consumer in terms of speed/simplicity etc. Can someone please let me know if the streams API is a good fit for my use case and at what point in my process i can plug it ?

It is true that Kafka Streams API has a simple way to consume records in comparison to Kafka Consumer API (e.g. you don't need to poll, manage a thread and loop), but it also comes with a cost (e.g. local data store - if you do stateful processing).
I would say that if you need to consume records one by one and call a REST API use the Consumer API, if you need stateful processing, query the topic state, etc. use the Streams API.
For more info take a look to this blog post: https://balamaci.ro/kafka-streams-for-stream-processing/

Reads messages from a remote IBM MQ (legacy system only works with
IBM MQ)
Writes these messages to Kafka Topic
I'd use Kafka Connect for (1) and (2). It is part of the Kafka project, and there are many free as well as commercial "connectors" available for hundreds of systems.
Reads these messages from the same Kafka Topic and calls a REST API.
You can use Kafka Streams as well as the lower-level Consumer API of Kafka, depending on what you prefer. I'd go with Kafka Streams as it is easier to use and far more powerful. (Both are part of the Kafka project.)
There could be other consumers reading from this topic in future.
This works out-of-the-box -- once data is stored in a Kafka topic according to step 2, many different applications and "consumers" can read this data independently.

Looks like you are not doing any processing/transformation once you consume you message from your IBM MQ or even after your Kafka Topic.
First one -> from IBM Mq to your Kafka Topic is kind of a pipeline and
Secondly -> You are just calling the REST API(I assume w/o any processing)
Considering these facts it seems to be a good fit for using Simple consumer.
Let's not use a technology only because it's there :)

Related

Kafka Consumer and Producer

Can I have the consumer act as a producer(publisher) as well? I have a user case where a consumer (C1) polls a topic and pulls messages. after processing the message and performing a commit, it needs to notify another process to carry on remaining work. Given this use case is it a valid design for Consumer (C1) to publish a message to a different topic? i.e. C1 is also acting as a producer
Yes. This is a valid use case. We have many production applications does the same, consuming events from a source topic, perform data enrichment/transformation and publish the output into another topic for further processing.
Again, the implementation pattern depends on which tech stack you are using. But if you after Spring Boot application, you can have look at https://medium.com/geekculture/implementing-a-kafka-consumer-and-kafka-producer-with-spring-boot-60aca7ef7551
Totally valid scenario, for example you can have connector source or a producer which simple push raw data to a topic.
The receiver is loosely coupled to your publisher so they cannot communicate each other directly.
Then you need C1 (Mediator) to consume message from the source, transform the data and publish the new data format to a different topic.
If you're using a JVM based client, this is precisely the use case for using Kafka Streams rather than the base Consumer/Producer API.
Kafka Streams applications must consume from an initial topic, then can convert(map), filter, aggregate, split, etc into other topics.
https://kafka.apache.org/documentation/streams/

Is Kafka a message queue and can Kafka be used as the database?

Some places mentioned Kafka is the publish-subscribe messaging. Other sources mentioned Kafka is the Message Queue. May I ask the differences between those and can Kakfa be used as the database?
There are 2 patterns named Publish-Subscribe and Message Queue. There are some places discussed the differences. here
Kafka especially supports both of these 2 patterns. For the publish-subscribe pattern, Kafka has publisher/subscriber which supported this pattern. The publisher sends messages to one topic and the subscriber can subscribes and receives messages on that one. For the queueing pattern, Kafka has a concept named Consumer Group. Within the same consumer group, all consumers will share jobs hence balancing the workload.
Because of the flexible design from the start, Kafka is broadly used for many software patterns while designing the system.
Personally, I would not call Kafka itself a database but you can use Kafka as the storage, especially through some mechanisms such as the log compaction. Ref1 Ref2
Kafka is a storage at base like a database but without indexes, where every query is a full scan of your data. Kafka it store data in files that can not be modified. Ex if you use event sourcing you can save all event of your system in Kafka and reprocess all events when your system have a bug.
Imagine that Kafka can split a very huge file(10TB or more) on multiple server and provide a way to read that file in a distributed manner using partitions( more partition you have, more application can read in parallel).
Because its a storage, Kafka can also be used as a message queue or as a publish-subscribe system.

Kafka streams vs Kafka connect for Kafka HBase ETL pipeline

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

Is there any difference between KafkaConsumer and KafkaStreams? [duplicate]

This question already has answers here:
Kafka: Consumer API vs Streams API
(3 answers)
Closed 5 years ago.
I'm using Apache Kafka 0.8.2.1, planning to upgrade my application to use Apache kafka 1.0.0.
While I inspect about Kafka Streams, I got some question about difference between KafkaConsumer and KafkaStreams.
Basically, KafkaConsumer have to consume from broker using polling method. I can specify some duration while polling, and whenever I got ConsumerRecored I can handle it to product some useful information.
KafkaStream, on the other hand, I don't have to specify any polling duration but just call start() method.
I know that KafkaConsumer basically used to consume literally, from broker and KafkaStreams can do various thing like Map-Reduce or interact with database, even re-produce to other kafka or any other systems.
So, there is my question. Is there any difference between KafkaConsumer and KafkaStream basically(in other words, When it comes to level of apache kafka library.)?
Yes, there is the difference between Kafka Consumer and Kafka Streams.
Kafka Consumer can be used at receiving end to receive data and process for future computation(based on topic and partition)
Kafka Streams API to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.
As you've said, they offer different functionalities:
KafkaStreams allows to perform complex processing on records
KafkaConsumer allows to receive records from a Kafka Cluster
KafkaStreams uses regular KafkaConsumers and KafkaProducers clients under the cover in order to retrieve records and send the results of processing to the brokers. It uses predefined values for many configurations but still exposes a lot of client configurations.
KafkaStreams is a regular (although pretty advanced) Kafka application using the Kafka clients (Consumer and Producer). Its APIs allow higher level applications to focus on the business logic and not on the Kafka details.
Also being part of the Apache Kafka distribution, it's using best practices and tricks to make the most of Kafka.