Kafka Stream API vs Consumer API - apache-kafka

I need to read from a specific Kafka topic, do a VERY short processing on the message and pass it on to a different Kafka cluster.
Currently, I'm using a consumer that's also a producer on the other kafka server.
However, the streaming API supposedly offers a more light-weight high-throughput option.
So the questions are:
Assuming my processing code doesn't require much horse power, is the streaming API better?
Does the streaming APi support writing to a different Kafka cluster?
What are the Streaming API cons comparing to the Consumer API?

Unfortunately KafkaStreams doesn't currently support writing to a different Kafka cluster.

Related

Process messages pushed through Kafka

I haven't used Kafka before and wanted to know if messages are published through Kafka what are the possible ways to capture that info?
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Haven't used Kafka before and while reading up I did find that Kafka needs ZooKeeper running too.
I don't need to publish info just process data received from Kafka publisher.
Any pointers will help.
Kafka is a distributed streaming platform that allows you to process streams of records in near real-time.
Producers publish records/messages to Topics in the cluster.
Consumers subscribe to Topics and process those messages as they are available.
The Kafka docs are an excellent place to get up to speed on the core concepts: https://kafka.apache.org/intro
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Kafka has its own TCP based protocol, not a native HTTP client (assuming that's what you actually mean by REST)
Consumers are the only way to get and subsequently process data, however plenty of external tooling exists to make it so you don't have to write really any code if you don't want to in order to work on that data

Apache Kafka StateStore

I am learning Apache Kafka (as a messaging system) and in that process came to know of term StateStore , link here
I am also aware of Apache kafka streams, the client API.
Is StateStore applicable for Apache kafka in the context of messaging systems or it is applicable to Apache Kafka Streams.
Does Apache have their "own" implementation of StateStore or use third party implementation (for example, rockdsb.
Can anyone help me understand this.
Adding an overview to the good concise explanation about StateStore in the context of Kafka Streams and your question.
Kafka Broker in a nutshell
In a messaging context your work simplified would be:
Publishing state (producing messages)
Saving messages for a period of time for later consumption (retention time)
Consuming state (getting the messages)
And in a nutshell #2 plus fault tolerance and keeping track of the position of your consumer groups' reads (offsets) is what a Kafka broker does for you.
Kafka client API's
Apart from that Kafka provides client libraries for your common patterns of working with messages:
Producer - Publish messages to Kafka topics
Consumer - Subscribe to Kafka topics
Connect - Create reliable integrations with external stores such as various DBMS.
Streams - DSL and utilities aimed to simplify development of common streaming application patterns.
Admin - Programmatically manage / monitor Kafka resources.
Kafka Streams State Stores
I'll quote the great explanation from the Streams Architecture docs (I highly recommend Kafka docs as they are built very good and for any level of experience).
Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as join() or aggregate(), or when you are windowing a stream.
As you can see the StateStore is used as a helper for extending the built-in abilities from a single message processing context to multi-message processing, thus enabling more complex functions over a bunch of messages (all the messages passed in a time window, aggregation functions over several messages, etc.)
I'll add to that that RocksDB is the default implementation used by Kafka and can be changed as was mentioned in previous answer.
Also if you want to explore more here is a link to the great intro videos form Apache Kafka's official docs:
Streams API intro videos
Have an awesome learning experience!
StateStore is applicable to kafka streams context.
Some processors like reduce or aggregate are stateful operations.
Kafka streams use state stores to manage this. By default, it uses rocksDB, but it is customizable.

How many streams are supported by kafka cluster

How can we check like how many streams are supported by a Kafka cluster with 3 nodes
My project is related to videos, I am transferring video from source to destination in the form of meta-data using Kafka. i.e I am doing some process on the videos then forming metadata and then sending this data to Kafka using Kafka-Producer API through topic 'test-topic'. I am a consumer class which takes that data(meta-data) and do some process. Like this I have implemented 3Kafka processors. I am running each process 6 times for different kind of inputs, so totally I have 18 processors. But here my doubt is not related to Kafka processors. My doubt is related to Kafka-Cluster. How streams are related to kafka-cluster and is there any way to know the number of streams supported by a kafka cluster.
It will hugely depend on what your application is doing, the throughput, and so on. Some general resources to help you:
Elastic Scaling in the Streams API in Kafka
Kafka Streams Capacity planning and sizing
For general deployment sizing of your Kafka clusters, see the Enterprise Reference Architecture

Consuming Kafka messages by two separate applications (storm and spark streaming)

We have a developed an ingestion application using Storm which consume Kafka messages (some times series sensor data) and save those messages into Cassandra. We use a Nifi workflow to do this.
I am now going to develop a separate Spark Streaming application which need to consume those Kafka messages as a source. I wonder why if there would be a problem when two application interacting with one Kafka Chanel? Should I duplicate Kafka messages in the Nifi to another Chanel so my Spark Streaming application use them, this is an overhead though.
From Kafka documentation:
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Which in your case means that your second application just have to use another consumer group, so that these two applications will get same messages.

Flafka (Http -> Flume->Kafka ->Spark Streaming)

I have one use case for real time streaming, we will be using Kafka(0.9) for message buffer and spark streaming(1.6) for stream processing (HDP 2.4). We will receive ~80-90K/Sec event on Http. Can you please suggest a recommended architecture for data ingestion into Kafka topics which will be consumed by spark streaming.
We are considering flafka architecture.
Is Flume listening to Http and sending to Kafka (Flafka )for real time streaming a good option?
Please share other possible approaches if any.
One approach could be Kafka Connect. Look for a source that fit in your needs or develop a custom new one.