Consume with Spark Streaming from a specific Kafka Topic with a delay

Consume with Spark Streaming from a specific Kafka Topic with a delay - streaming

I'm consuming messages with Spark Streaming from topics in kafka and it works just fine.
However I would like to consume from a specific topic with a given delay (Like if the records are coming one by one with a delay).
For example : Consume records from Test_1 topic every 2 min and from Test_2 topic every 5 sec.
Is it possible to achieve that ?
Thank you for your time

Related

How Kafka Streams API poll works and how to identify time period where there is no transactions from poll

Can someone explain how poll of records from Kafka topic will work in Kafka Streams API(KStream).
My challenge is, I'm processing records as batch of 100 and I mentioned commit interval as 2 hours. There are less records in Kafka and my code processed 50 records and it is trying to get records in Kafka topic but there are no records. Now I'm not able to insert 50 records because I'm processing as a batch of 100.
Now, I need to identify that there are no transactions coming up from Kafka and I have to insert 50 records so that we can process data in real-time.
Can someone please gave suggestion on this.

Kafka to Kafka -> reading source kafka topic multiple times

I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?

Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/

Kafka Streams:Aggregate hourly message in Kafka

I have use case where for every second 5k messages are sent to a kafka topic,On the Kafka consumer side I need to aggregate all message for that hour and write files hourly.We are just getting started in Kafka Streams and wondering whether this use case to aggregate message hourly is a right fit for Kafka streams.
Thanks!

Flafka : How to write into partitoined Kafka topic (channels) which is partitioned on msg type through Flume agent

In my project , We have total 11 WSMQ as source for Flume agents. Kafka topic as channel, Kafka Topic is partitioned on message types. HDFS as sink.
Requirement: We want to read from multiples queues using flume-agent and write to specific partition of kafka topic so that at later point of time , we could read data from those kafka partitions.
Right now, we have total 11 kafka topics for each queue where agents are writing the messages. We want to have a single Kafka topic which is partitioned on msg types and incoming msgs could be written to those partitions.
Can anybody suggest , what would be best approach for this use case?
Thanks!

We solved it by Spark streaming.

Kafka Topic retention.ms not working when used in Spark Streaming context

I am running a Spark Streaming job (means that data keeps getting pushed to a kafka topic and read by Spark consumer continuously). My Kafka topic for Input data has a retention time set to 60000 (1 Min). However, Input Topic doesn't clear messages after 1 minute. It takes approx 1:26 mins to clear if no new data got added to the topic.
If I add data continously for two mintues, I would expect half of old data to be cleared because of retention.ms set to 1 min. But I just see double data.
Has anyone seen similar pattern. How can I resolve this? Would you need more details?

You need to set the property log.retention.check.interval.ms to set the frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Consume with Spark Streaming from a specific Kafka Topic with a delay - streaming

Related

How Kafka Streams API poll works and how to identify time period where there is no transactions from poll

Kafka to Kafka -> reading source kafka topic multiple times

Kafka Streams:Aggregate hourly message in Kafka

Flafka : How to write into partitoined Kafka topic (channels) which is partitioned on msg type through Flume agent

Kafka Topic retention.ms not working when used in Spark Streaming context

Categories

Resources