I'm consuming messages with Spark Streaming from topics in kafka and it works just fine.
However I would like to consume from a specific topic with a given delay (Like if the records are coming one by one with a delay).
For example : Consume records from Test_1 topic every 2 min and from Test_2 topic every 5 sec.
Is it possible to achieve that ?
Thank you for your time
Related
Can someone explain how poll of records from Kafka topic will work in Kafka Streams API(KStream).
My challenge is, I'm processing records as batch of 100 and I mentioned commit interval as 2 hours. There are less records in Kafka and my code processed 50 records and it is trying to get records in Kafka topic but there are no records. Now I'm not able to insert 50 records because I'm processing as a batch of 100.
Now, I need to identify that there are no transactions coming up from Kafka and I have to insert 50 records so that we can process data in real-time.
Can someone please gave suggestion on this.
I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?
Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/
I have use case where for every second 5k messages are sent to a kafka topic,On the Kafka consumer side I need to aggregate all message for that hour and write files hourly.We are just getting started in Kafka Streams and wondering whether this use case to aggregate message hourly is a right fit for Kafka streams.
Thanks!
In my project , We have total 11 WSMQ as source for Flume agents. Kafka topic as channel, Kafka Topic is partitioned on message types. HDFS as sink.
Requirement: We want to read from multiples queues using flume-agent and write to specific partition of kafka topic so that at later point of time , we could read data from those kafka partitions.
Right now, we have total 11 kafka topics for each queue where agents are writing the messages. We want to have a single Kafka topic which is partitioned on msg types and incoming msgs could be written to those partitions.
Can anybody suggest , what would be best approach for this use case?
Thanks!
We solved it by Spark streaming.
I am running a Spark Streaming job (means that data keeps getting pushed to a kafka topic and read by Spark consumer continuously). My Kafka topic for Input data has a retention time set to 60000 (1 Min). However, Input Topic doesn't clear messages after 1 minute. It takes approx 1:26 mins to clear if no new data got added to the topic.
If I add data continously for two mintues, I would expect half of old data to be cleared because of retention.ms set to 1 min. But I just see double data.
Has anyone seen similar pattern. How can I resolve this? Would you need more details?
You need to set the property log.retention.check.interval.ms to set the frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion.