Kafka stream vs Kafka Consumer and Java Stream APIs - apache-kafka

I am new to Kafka Stream. I wanted to know what benefits one get by using kafka stream APIs instead of using standard kafka Producer / Consumer APIs and doing the stream processing in Consumer with Java Stream APIs.

Java Stream API has nothing to do with Kafka APIs. You can do consumerRecords.forEach from the Consumer poll iterator, or call producer send method from a Java stream as well, and that's about it.
The Kafka Streams API allows you to more easily map/filter/branch/join topic data, as well as access persistent StateStores. It also has foreach and peek methods, which I assume you're referring to? You have more direct access when to commit your offsets with the Consumer API than any Kafka Streams methods.
For producer, you cannot use Kafka Streams API to produce original data into a topic.

Related

Process messages pushed through Kafka

I haven't used Kafka before and wanted to know if messages are published through Kafka what are the possible ways to capture that info?
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Haven't used Kafka before and while reading up I did find that Kafka needs ZooKeeper running too.
I don't need to publish info just process data received from Kafka publisher.
Any pointers will help.
Kafka is a distributed streaming platform that allows you to process streams of records in near real-time.
Producers publish records/messages to Topics in the cluster.
Consumers subscribe to Topics and process those messages as they are available.
The Kafka docs are an excellent place to get up to speed on the core concepts: https://kafka.apache.org/intro
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Kafka has its own TCP based protocol, not a native HTTP client (assuming that's what you actually mean by REST)
Consumers are the only way to get and subsequently process data, however plenty of external tooling exists to make it so you don't have to write really any code if you don't want to in order to work on that data

Apache Kafka StateStore

I am learning Apache Kafka (as a messaging system) and in that process came to know of term StateStore , link here
I am also aware of Apache kafka streams, the client API.
Is StateStore applicable for Apache kafka in the context of messaging systems or it is applicable to Apache Kafka Streams.
Does Apache have their "own" implementation of StateStore or use third party implementation (for example, rockdsb.
Can anyone help me understand this.
Adding an overview to the good concise explanation about StateStore in the context of Kafka Streams and your question.
Kafka Broker in a nutshell
In a messaging context your work simplified would be:
Publishing state (producing messages)
Saving messages for a period of time for later consumption (retention time)
Consuming state (getting the messages)
And in a nutshell #2 plus fault tolerance and keeping track of the position of your consumer groups' reads (offsets) is what a Kafka broker does for you.
Kafka client API's
Apart from that Kafka provides client libraries for your common patterns of working with messages:
Producer - Publish messages to Kafka topics
Consumer - Subscribe to Kafka topics
Connect - Create reliable integrations with external stores such as various DBMS.
Streams - DSL and utilities aimed to simplify development of common streaming application patterns.
Admin - Programmatically manage / monitor Kafka resources.
Kafka Streams State Stores
I'll quote the great explanation from the Streams Architecture docs (I highly recommend Kafka docs as they are built very good and for any level of experience).
Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as join() or aggregate(), or when you are windowing a stream.
As you can see the StateStore is used as a helper for extending the built-in abilities from a single message processing context to multi-message processing, thus enabling more complex functions over a bunch of messages (all the messages passed in a time window, aggregation functions over several messages, etc.)
I'll add to that that RocksDB is the default implementation used by Kafka and can be changed as was mentioned in previous answer.
Also if you want to explore more here is a link to the great intro videos form Apache Kafka's official docs:
Streams API intro videos
Have an awesome learning experience!
StateStore is applicable to kafka streams context.
Some processors like reduce or aggregate are stateful operations.
Kafka streams use state stores to manage this. By default, it uses rocksDB, but it is customizable.

Kafka Stream vs Apache Storm?

I know how Kafka can be used as a message queue and can also be connected to Storm for real time data processing, I want to know Is there any advantage of using Kafka stream api over storm ?

Consuming Kafka messages by two separate applications (storm and spark streaming)

We have a developed an ingestion application using Storm which consume Kafka messages (some times series sensor data) and save those messages into Cassandra. We use a Nifi workflow to do this.
I am now going to develop a separate Spark Streaming application which need to consume those Kafka messages as a source. I wonder why if there would be a problem when two application interacting with one Kafka Chanel? Should I duplicate Kafka messages in the Nifi to another Chanel so my Spark Streaming application use them, this is an overhead though.
From Kafka documentation:
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Which in your case means that your second application just have to use another consumer group, so that these two applications will get same messages.

Kafka Stream API vs Consumer API

I need to read from a specific Kafka topic, do a VERY short processing on the message and pass it on to a different Kafka cluster.
Currently, I'm using a consumer that's also a producer on the other kafka server.
However, the streaming API supposedly offers a more light-weight high-throughput option.
So the questions are:
Assuming my processing code doesn't require much horse power, is the streaming API better?
Does the streaming APi support writing to a different Kafka cluster?
What are the Streaming API cons comparing to the Consumer API?
Unfortunately KafkaStreams doesn't currently support writing to a different Kafka cluster.