Which consumer API to use for kafka 0.10.1? - apache-kafka

I am new to Kafka and I have a Kafka leader version 0.10.0 and a zookeeper version 3.4.6.There is two type of Kafka consumer API that I came across:
1. Kafka Polling
2. Kafka Streams
I am not able to find the significant difference between these two. What is the difference between Kafka polling and Kafka streaming consumers?What are the use cases suitable to both?
Any help is appreciated.

KafkaStreams:
KafkaStreams is used to do computation on data from one topic and send computed data to other topic.
Internally kafkaStreams use Producer and Consumer both.
KafkaPolling :
Kafka polling in kafka consumer fetches data from topic and its part of consumer process.
From my point of view if you just want to consume data from a topic, go for kafka consumer, else if you want to do some computation and save it for further use, use kafka streams.

Related

Kafka throttle producer based on consumer lag

Is there any way to pause or throttle a Kafka producer based on consumer lag or other consumer issues? Would the producer need to determine itself if there is consumer lag then perform throttling itself?
Kafka is build on Pub/Sub design. Producer publish the message to centralized topic. Multiple consumers can subscribe to that topic. Since multiple consumers are involve you cannot decide on producer speed. One consumer can be slow other can be fast. Also it is against the design principle otherwise both system will become tightly couple. If you have use case of throttling may be you should evaluate other framework like direct rest call.
Producer and Consumer are decoupled.
Producer push data to Kafka topics (partitions topic), that are stored in Kafka Brokers. Producer doesn't know who and how often consume messages.
Consumer consume data from Brokers. Consumer doesn't know how many producers produce the messages. Even the same messages can be consumed by several consumers that are in different groups. In example some consumer can consume faster than the other.
You can read more about Producer and Consumer in Apache Kafka webpage
It is not possible to throttle the producer/producers weighing on performance of consumers.
In my scenario I don't want to loose events if the disk size is
exceeded before a message is consumed
To tackle your issue, you have to depend on the parallelism offering by the Kafka. Your Kafka topic should have multiple partitions and producers has to use different keys to populate the topic. So your data will be distributed across multiple partitions and bringing a consumer group you can manage load within a group of consumers. All data within a partition can be processed in order, that may be relevant since you are dealing with event processing.

Kafka Streams: Internal topic partitions

Kafka version: 1.0.0
Let's say the stream application uses low level processor API which maintains the state and reads from a topic with 10 partitions. Please clarify if the internal topic is expected to be created with the same number of partitions OR is it per the broker default. If it's the later, if we need to increase the partitions of the internal topic, is there any option?
Kafka Streams will create the topic for you. And yes, it will create it with the same number of partitions as your input topic. During startup, Kafka Streams also checks if the topic has the expected number of partitions and fails if not.
The internal topic is basically a regular topic as any other and you can change the number of partitions via command line tools like for any other topic. However, this should never be required. Also note, that dropping/adding partitions, will mess up your state.

Kafka Stream program is reprocessing the already processed events

I forwarded few events to Kafka and started my Kafka stream program. My program started processing the events and completed. After some time I stopped my Kafka stream application and I started again. Observed that My Kafka stream program is processing the already processed previous events.
As per my understanding, Kafka stream internally maintains the offset for input topics itself per application id. But here reprocessing the already processed events.
How to verify up to which offset Kafka stream processing was done? How Kafka stream persisted these bookmarks? On what basis & from which Kafka offset, Kafka stream will start read the events from Kafka?
If Kafka steam throws exceptions then is it reprocessed already processed events?
Please clarify my doubts.
Please help me to under stand more.
Kafka Streams internally uses a KafkaConsumer and all running instances form a consumer group using application.id as group.id. Offsets are committed to the Kafka cluster in regular intervals (configurable). Thus, on restart with the same application.id Kafka Streams should pick up the latest committed offset and continue processing from there.
You can check committed offset as for any other consumer group using bin/kafka-consumer-groups.sh tool.

starting stream before producer in kafka

Is it possible to start a kafka stream before producer writes data to a topic? In other words, can a stream get started and wait for producer to start and write to a topic, and when it(producer) does, the streams automatically starts consuming it - is that feasible?
Yes of course you can start a Kafka Streams application before producing.
As soon as you'll start producing to the topics Streams uses, it will start processing the records.

Kafka consumer group.id multiple messages

I have developed a Kafka consumer and there will be multiple instances of this consumer running in production. I know how we can use group.id as to not duplicate the processing of data. Is there a way to have all the consumers receive the message but send one consumer a leader bit?
Is there a way to have a group.id per topic or even per key in a topic?
Looks like this has nothing to with Kafka. You already know that by providing a unique group.id for each consumer, all consumer instances will get all messages from the topic. Now as far as the push to DB is concerned - you can factor out that logic and try using a distributed lock so that the push to DB part of your application can only be executed by one of the consumers. Is this a Java based setup ?