stop spark Streaming context kafkaDirectStream - scala

I want to end the stream processing after receiving and processing is complete from kafka topic. The stop should no be time specific like ( awaitTerminationOrTimeout ). Is there a way to stop the sparkstreamingcontext after the topic exhausts. is there a way for the Dstream[T] to be compared with T values to dictate the control flow?

I'm about 80% cetain that isEmpty should return true and headOption should be None on KafkaMessageStream if the stream is empty.

The best way is, before you start reading the stream, get the latest offsets for all of the partitions in the topic and then check when the received offsets have gotten there. If you want to find out how to get the offsets of a topic, see my previous answer on this.
The flow ends up being:
Get the partitions and brokers for the topic
For each broker, create a SimpleConsumer
For each partition, do an OffsetRequest where you return the
earliest and latest offsets (see previous answer)
Then as you read messages, check the offset of the received message
relative to the know last offset for the partition
Once all the offsets received for each partition are the same as the
latest receieved in your OffsetRequest you are done

Related

How to determine topic has been read completely by Kafka Stream application from very first offset to last offset from Java application

I need some help in Kafka Streams. I have started a Kafka stream application, which is streaming one topic from the very first offset. Topic is very huge in data, so I want to implement a mechanism in my application, using Kafka streams, so that I can get notified when topic has been read completely to the very last offset.
I have read Kafka Streams 2.8.0 api, I have found an api method i-e allLocalStorePartitionLags, which is returning map of store names to another map of partition containing all the lag information against each partition. This method returns lag information for all store partitions (active or standby) local to this Streams. This method is quite useful for me, in above case, when I have one node running that stream application.
But in my case, system is distributed and application nodes are 3 and topic partitions are 10, which meaning each node have at least 3 partitions for the topic to read from.
I need help here. How I can implement this functionality where I can get notified when topic has been read completely from partition 0 to partition 9. Please note that I don't have option to use database here as of now.
Other approaches to achieve goal are also welcomed. Thank you.
I was able to achieve lag information from adminClient api. Below code results end offsets and current offsets for each partitions against topics read by given stream application i-e applicationId.
AdminClient adminClient = AdminClient.create(kafkaProperties);
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(applicationId);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));

Is it possible in Kafka to read messages in reverse manner?

Can be created a new consumer group with a consumer which assigned to existing topiс, but somehow set a preference to consume backward: offset will move from the latest message for the moment to the earliest in every partition?
Kafka topics are meant to be consumed sequentually in the order of appearance within the topic partitions.
However, I see two options to solve your issue:
You can steer the consumer what data it poll from the topic partition like: Have your consumer seek to the latestet offset, then consume it and then seek to the latest offset minus one but read only one offset. Again seek to the previous offset and so on. Although I have never seen it, this should be possible with the consumer.seek and the ConsumerConfiguration max.poll.records.
You could use any kind of state store and order it descending by the offset for each partition. Then have another consumer reading the state store in the desired order.

Kafka to Kafka -> reading source kafka topic multiple times

I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?
Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/

Reading messages for specific timestamp in kafka

I want to read all the messages starting from a specific time in kafka.
Say I want to read all messages between 0600 to 0800
Request messages between two timestamps from Kafka
suggests the solution as the usage of offsetsForTimes.
Problem with that solution is :
If say my consumer is switched on everyday at 1300. The consumer would not have read any messages that day, which effectively means no offset was committed at/after 0600, which means offsetsForTimes(< partitionname > , <0600 for that day in millis>) will return null.
Is there any way I can read a message which was published to kafka queue at a certain time, irrespective of offsets?
offsetsForTimes() returns offsets of messages that were produced for the requested time. It works regardless if offsets were committed or not because the offsets are directly fetched from the partition logs.
So yes you should be using this method to find the first offset produced after 0600, seek to that position and consume messages until you reach 0800.

Recover Lost message in KAFKA using OFFSET

I have been asked this question in an interview.
Imagine a packet was lost because of a failure(Not sure its consumer failure or broker). What should be done(code implementation) to recover the lost messages during this time using offset?
i am sorry my question might not be clear as it was asked similar way.
thanks
If you know the offset of the message you want to recover, and which partition it belonged to you can use KafkaConsumer method seek:
consumer.seek(new TopicPartition("topic-name", partNumber), offsetNumber);
as detailed here
the next call to poll() would give you the message you missed first in the list.
This would only work in a scenario where you are managing your offsets yourself in the first place. In case you are letting Kafka manage the offsets, you probably don't know the offset number and the best you will probably end up with messages consumed twice (a call to poll() will begin consuming from the last committed offset).
Kafka follows the at-least once message delivery semantics, it means you might get duplicate at the time of broker failure, you will not lose the data.
But when you create Kafka Producer if you have this property as 0, then it will try to send only once, even in the case of broker failure also it will not try to resend. So you might lose data if the broker fails.
props.put("retries", 0);
So you can change this property value to 1, so it will try to send again, also offsets are managed in Zookeeper automatically, if the message is delivered sucessfully only , it will update the offsets in Zookeeper.
Also, since you mentioned SPark Streaming to consume, SPark Streaming supports two different approaches.
1. Receiver based:
Offsets are handled in Zookeeper.
2. Direct Approach:
Offsets are handled locally where the messages are stored, also this approach supports Exactly-once message delivery.
For more info check this link
After Reading lot of articles and documentation i felt the Best Answer Might Be :
Using New Spark Kafka Consumer with no receivers(spark-streaming-kafka-0-10_2.11). In this approach we can give startOffset from where we want to read.
val offsetRanges = Array( // topic, partition, inclusive starting
offset, exclusive ending offset OffsetRange("test", 0, 0, 100),
OffsetRange("test", 1, 0, 100) )
val rdd = KafkaUtils.createDirectStream[String, String](sparkContext,
kafkaParams, offsetRanges, PreferConsistent)
Once your messages were read and processed, get the offsets you read and store them in Kafka or Zk or External transactional Database.
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
Every time We are Starting Job, fetch the Offsets from Database and pass it to createDirectStream to have exacly once mechanism.
More reading
http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html