Is it possible to start a kafka stream before producer writes data to a topic? In other words, can a stream get started and wait for producer to start and write to a topic, and when it(producer) does, the streams automatically starts consuming it - is that feasible?
Yes of course you can start a Kafka Streams application before producing.
As soon as you'll start producing to the topics Streams uses, it will start processing the records.
Related
When I have a kafka console producer message produce some messages and then start a consumer, I am not getting the messages.
However i am receiving message produced by the producer after a consumer has been started.Should Kafka consumers be started before producers?
--from- beginning seems to give all messages including ones that are consumed.
Please help me with this on both console level and java client example for starting producer first and consuming by starting a consumer.
Kafka stores messages for a configurable amount of time. Default is a week. Consumers do not need to be "available" to receive messages, but they do need to know where they should start reading from
The console consumer has the default option of looking at the latest offset for all partitions. So if you're not actively producing data you see nothing as a consumer. You can specify a group flag for the console consumer or a Java client, and that's what tracks what offsets are read within the Kafka protocol and where a read request will resume from if you stopped that consumer in a group
Otherwise, I think you can only give an offset along with a single partition to consume from
I forwarded few events to Kafka and started my Kafka stream program. My program started processing the events and completed. After some time I stopped my Kafka stream application and I started again. Observed that My Kafka stream program is processing the already processed previous events.
As per my understanding, Kafka stream internally maintains the offset for input topics itself per application id. But here reprocessing the already processed events.
How to verify up to which offset Kafka stream processing was done? How Kafka stream persisted these bookmarks? On what basis & from which Kafka offset, Kafka stream will start read the events from Kafka?
If Kafka steam throws exceptions then is it reprocessed already processed events?
Please clarify my doubts.
Please help me to under stand more.
Kafka Streams internally uses a KafkaConsumer and all running instances form a consumer group using application.id as group.id. Offsets are committed to the Kafka cluster in regular intervals (configurable). Thus, on restart with the same application.id Kafka Streams should pick up the latest committed offset and continue processing from there.
You can check committed offset as for any other consumer group using bin/kafka-consumer-groups.sh tool.
I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.
I am new to Kafka and I have a Kafka leader version 0.10.0 and a zookeeper version 3.4.6.There is two type of Kafka consumer API that I came across:
1. Kafka Polling
2. Kafka Streams
I am not able to find the significant difference between these two. What is the difference between Kafka polling and Kafka streaming consumers?What are the use cases suitable to both?
Any help is appreciated.
KafkaStreams:
KafkaStreams is used to do computation on data from one topic and send computed data to other topic.
Internally kafkaStreams use Producer and Consumer both.
KafkaPolling :
Kafka polling in kafka consumer fetches data from topic and its part of consumer process.
From my point of view if you just want to consume data from a topic, go for kafka consumer, else if you want to do some computation and save it for further use, use kafka streams.
I produce some messages first and these messages are persisted on disk by kafka's brokers. Then I start the spark streaming program to process these data, but I can't receive anything in spark streaming. And there is not any error log.
However, If I produce message when the spark streaming program is running, it can receive data.
Can spark streaming only receive the real time data from kafka?
To control the behavior of what data is consumed at the start of a new consumer stream, you should provide auto.offset.reset as part of the properties used to create the kafka stream.
auto.offset.reset can take the following values:
earliest => the kafka topic will be consumed from the earliest offset available
latest => the kafka topic will be consumed, starting at the current latest offset
Also note that depending on the kafka consumer model you are using (received-based or direct), the behavior of a restarted spark streaming job will be different.