Kafka Stream program is reprocessing the already processed events - apache-kafka

I forwarded few events to Kafka and started my Kafka stream program. My program started processing the events and completed. After some time I stopped my Kafka stream application and I started again. Observed that My Kafka stream program is processing the already processed previous events.
As per my understanding, Kafka stream internally maintains the offset for input topics itself per application id. But here reprocessing the already processed events.
How to verify up to which offset Kafka stream processing was done? How Kafka stream persisted these bookmarks? On what basis & from which Kafka offset, Kafka stream will start read the events from Kafka?
If Kafka steam throws exceptions then is it reprocessed already processed events?
Please clarify my doubts.
Please help me to under stand more.

Kafka Streams internally uses a KafkaConsumer and all running instances form a consumer group using application.id as group.id. Offsets are committed to the Kafka cluster in regular intervals (configurable). Thus, on restart with the same application.id Kafka Streams should pick up the latest committed offset and continue processing from there.
You can check committed offset as for any other consumer group using bin/kafka-consumer-groups.sh tool.

Related

Should Kafka consumers be started before producers?

When I have a kafka console producer message produce some messages and then start a consumer, I am not getting the messages.
However i am receiving message produced by the producer after a consumer has been started.Should Kafka consumers be started before producers?
--from- beginning seems to give all messages including ones that are consumed.
Please help me with this on both console level and java client example for starting producer first and consuming by starting a consumer.
Kafka stores messages for a configurable amount of time. Default is a week. Consumers do not need to be "available" to receive messages, but they do need to know where they should start reading from
The console consumer has the default option of looking at the latest offset for all partitions. So if you're not actively producing data you see nothing as a consumer. You can specify a group flag for the console consumer or a Java client, and that's what tracks what offsets are read within the Kafka protocol and where a read request will resume from if you stopped that consumer in a group
Otherwise, I think you can only give an offset along with a single partition to consume from

starting stream before producer in kafka

Is it possible to start a kafka stream before producer writes data to a topic? In other words, can a stream get started and wait for producer to start and write to a topic, and when it(producer) does, the streams automatically starts consuming it - is that feasible?
Yes of course you can start a Kafka Streams application before producing.
As soon as you'll start producing to the topics Streams uses, it will start processing the records.

Which consumer API to use for kafka 0.10.1?

I am new to Kafka and I have a Kafka leader version 0.10.0 and a zookeeper version 3.4.6.There is two type of Kafka consumer API that I came across:
1. Kafka Polling
2. Kafka Streams
I am not able to find the significant difference between these two. What is the difference between Kafka polling and Kafka streaming consumers?What are the use cases suitable to both?
Any help is appreciated.
KafkaStreams:
KafkaStreams is used to do computation on data from one topic and send computed data to other topic.
Internally kafkaStreams use Producer and Consumer both.
KafkaPolling :
Kafka polling in kafka consumer fetches data from topic and its part of consumer process.
From my point of view if you just want to consume data from a topic, go for kafka consumer, else if you want to do some computation and save it for further use, use kafka streams.

Does a Kafka Consumer receive a list of offsets first, before receiving the bytes/data?

I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick
On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.

spark streaming cannot receive data from kafka if send some message to kafka beforehand

I produce some messages first and these messages are persisted on disk by kafka's brokers. Then I start the spark streaming program to process these data, but I can't receive anything in spark streaming. And there is not any error log.
However, If I produce message when the spark streaming program is running, it can receive data.
Can spark streaming only receive the real time data from kafka?
To control the behavior of what data is consumed at the start of a new consumer stream, you should provide auto.offset.reset as part of the properties used to create the kafka stream.
auto.offset.reset can take the following values:
earliest => the kafka topic will be consumed from the earliest offset available
latest => the kafka topic will be consumed, starting at the current latest offset
Also note that depending on the kafka consumer model you are using (received-based or direct), the behavior of a restarted spark streaming job will be different.