I have been asked this question in an interview.
Imagine a packet was lost because of a failure(Not sure its consumer failure or broker). What should be done(code implementation) to recover the lost messages during this time using offset?
i am sorry my question might not be clear as it was asked similar way.
thanks
If you know the offset of the message you want to recover, and which partition it belonged to you can use KafkaConsumer method seek:
consumer.seek(new TopicPartition("topic-name", partNumber), offsetNumber);
as detailed here
the next call to poll() would give you the message you missed first in the list.
This would only work in a scenario where you are managing your offsets yourself in the first place. In case you are letting Kafka manage the offsets, you probably don't know the offset number and the best you will probably end up with messages consumed twice (a call to poll() will begin consuming from the last committed offset).
Kafka follows the at-least once message delivery semantics, it means you might get duplicate at the time of broker failure, you will not lose the data.
But when you create Kafka Producer if you have this property as 0, then it will try to send only once, even in the case of broker failure also it will not try to resend. So you might lose data if the broker fails.
props.put("retries", 0);
So you can change this property value to 1, so it will try to send again, also offsets are managed in Zookeeper automatically, if the message is delivered sucessfully only , it will update the offsets in Zookeeper.
Also, since you mentioned SPark Streaming to consume, SPark Streaming supports two different approaches.
1. Receiver based:
Offsets are handled in Zookeeper.
2. Direct Approach:
Offsets are handled locally where the messages are stored, also this approach supports Exactly-once message delivery.
For more info check this link
After Reading lot of articles and documentation i felt the Best Answer Might Be :
Using New Spark Kafka Consumer with no receivers(spark-streaming-kafka-0-10_2.11). In this approach we can give startOffset from where we want to read.
val offsetRanges = Array( // topic, partition, inclusive starting
offset, exclusive ending offset OffsetRange("test", 0, 0, 100),
OffsetRange("test", 1, 0, 100) )
val rdd = KafkaUtils.createDirectStream[String, String](sparkContext,
kafkaParams, offsetRanges, PreferConsistent)
Once your messages were read and processed, get the offsets you read and store them in Kafka or Zk or External transactional Database.
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
Every time We are Starting Job, fetch the Offsets from Database and pass it to createDirectStream to have exacly once mechanism.
More reading
http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Related
I have a requirement where i want to sample a kafka topic(for checking its data quality, etc) before triggering a streaming job onto it. One of the parameter to sampling could be number of messages.
I am referencing "http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries" but only found these methods, "startingOffsets" and "endingOffsets". It won't be possible to read first N messages from it as offsets need not be continous (in case of compaction or deletion of messages).
Looking for any suggestion or help. Thanks.
Referencing your post you don't need to build a streaming application. You want to create a batch job which will perform data quality control.
KafkaUtils.createRDD method can help to passing offset ranges which you want to extract from Kafka. It takes array of offsetRanges for each Kafka partition. With help of these method it is possible to configure the number of messages you want to extract from Kafka topic.
Below code will read 10 messages from Kafka topic (5 messages from each partition):
val offsetRanges = Array(
// topic, partition, inclusive starting offset, exclusive ending offset
OffsetRange("topic_name", 0, 0, 5),
OffsetRange("topic_name", 1, 0, 5)
)
val rdd = KafkaUtils.createRDD[String, String](sparkContext, kafkaParams, offsetRanges, PreferConsistent)
Note: Kafka TTL will delete messages from the topic so you need to set offsets carefully other-ways you application will try to fetch unexisting messages. Ideally is to use beginningOffsets method from Kafka consumer interface.
You can find more in Spark documentation: https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-an-rdd
I am currently writing a Spark streaming application that reads data from Kafka and tries to decode it before applying some transformations.
The current code structure looks like this:
val stream = KafkaUtils.createDirectStream[String, String](...)
.map(record => decode(record.value())
.filter(...)
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
...
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
The decoding and filtering of failures happens on the DStream, and the offset management is done inside the foreachRDD, which means that I will only commit successful records.
To commit the failed records, I could move everything inside the foreachRDD loop:
val stream = KafkaUtils.createDirectStream[String, String](...)
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
...
// Decoding and filtering here
...
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
However, I am wondering whether there is another way to commit the failed records. Maybe it would be acceptable to not commit the failed records?
I am assuming you are using the spark-streaming-kafka library.
Reading the documentation of offset Ranges it stores the range of offsets from the topic partitions. It does not filter out or mark individual offsets within that range based on the clients filtering ".filter(…)" actions. So if you commit that offsetRanges it will commit the highest offset number per partition, regardless of your filter actions.
That makes sense, as your Consumer is telling the Kafka Broker, or more precisely, the Group Coordinator that it consumed these messages. The coordinator is not interested in what you are actually doing with the data, it just wants to know if that particular Consumer Group was reading a message/offset or not.
Coming back to your questions...
I am wondering whether there is another way to commit the failed records.
Although it doesn't look like you need it, but yes, there is another way of committing "failed" records. You can enable auto commit. Together with the Consumer configuration auto.commit.interval.ms, you can periodically commit the offsets your consumer polled from the topic.
Maybe it would be acceptable to not commit the failed records?
I don't have the knowledge of your particular use case, but it is acceptable to not commit the failed records. As mentioned above, the Group Coordinater is interested in the highest offset per partition that your consumer has consumed. If you consume a topic with 10 messages, you start reading from beginning and you only commit the 9th offset (offset counting starts at 0), then the next time you start your consumer it will ignore the first ten messages.
You could check out the Kafka internal topic __consumer_offsets to see what is stored for each Consumer Group: Topic, Partition, Offset (… among others).
While implementing manual offset management, I encountered the following issue: (using 0.9)
In order to manage the offsets manually, for each consumed record, I retrieve the current offset of the record and commit the new offset (currentOffset + 1, since the offset reset strategy is "latest").
When a new consumer group is created, it has no explicit offsets (offset is "unknown"), therefore, if it didn't consume messages from all existing partitions before it is stopped, it will have committed offsets for only part of the partitions (the ones the consumer got messages from), while the offset for the rest of the partitions will still be "unknown".
When the consumer is started again, it gets only some of the messages that were produced while it was down (only the ones from the partitions that had a committed offset), the messages from partitions with "unknown" offset are lost and will never be consumed due to the offset reset strategy.
Since it's unacceptable in my case to miss any messages once a consumer group is created, I'd like to explicitly commit an offset for each partition before starting consumption.
To do that I found two options:
Use low level consumer to send an offset request.
Use high level consumer, call consumer.poll(0) (to trigger the assignment), then call consumer.assignment(), and for each TopicPartition call consumer.committed(topicPartition); consumer.seekToEnd(topicPartition); consumer.position(topicPartition) and eventually commit all offsets.
Both are more complex and noisy than I'd expect (I'd expect a simpler API I could use to get the log end position for all partitions assigned to a consumer).
Any thoughts or ideas for a better implementation would be appreciated.
10x.
Using consumer API totally depends upon where are you committing offsets.
If your offsets are getting stored in Kafka broker then definitely
you should use high-level consumer API it will provide you with more control
over offsets.
If you are keeping offsets in zookeeper than you can use any old consumer API like
List< KafkaStream < byte[], byte[] > > streams
=consumer.createMessageStreamsByFilter(new Whitelist(topicRegex),1)
I see in some answers around stack-overflow and in general in the web the idea that Kafka does not support consumption acknowledge or that exactly once consumption is hard to achieve.
In the following entry as a sample
Is there any reason to use RabbitMQ over Kafka?, I can read the following statements:
RabbitMQ will keep all states about consumed/acknowledged/unacknowledged messages while Kafka doesn't
or
Exactly once guarantees are hard to get with Kafka.
This is not what I understand by reading the official Kafka documentation at:
https://kafka.apache.org/documentation/#design_consumerposition
The previous documentation states that Kafka does not use a traditional acknowledge implementation (as RabbitMQ). Instead they rely on the relationship partition-consumer and offset...
This makes the equivalent of message acknowledgements very cheap
Could somebody please explain why "only once consumption guarantee" in Kafka is difficult to achieve? and How this differs from Kafka vs other more traditional Message Broker as RabbitMQ? What am I missing?
If you mean exactly once the problem is like this.
Kafka consumer as you may know use a polling mechanism, that is consumers ask the server for messages. Also, you need to recall that the consumer commit message offsets, that is, it tells the cluster what is the next expected offset. So, imagine what could happen.
Consumer poll for messages and get message with offset = 1.
A) If consumer commit that offset immediately before processing the message, then it can crash and will never receive that message again because it was already committed, on next poll Kafka will return message with offset = 2. This is what they call at most once semantic.
B) If consumer process the message first and then commit the offset, what could happen is that after processing the message but before committing, the consumer crashes, so in that case next poll will get again the same message with offset = 1 and that message will be processed twice. This is what they call at least once.
In order to achieve exactly once, you need to process the message and commit that offset in an atomic operation, where you always do both or none of them. This is not so easy. One way to do this (if possible) is to store the result of the processing along with the offset of the message that generated that result. Then, when consumer starts it looks for the last processed offset outside Kafka and seek to that offset.
I want to end the stream processing after receiving and processing is complete from kafka topic. The stop should no be time specific like ( awaitTerminationOrTimeout ). Is there a way to stop the sparkstreamingcontext after the topic exhausts. is there a way for the Dstream[T] to be compared with T values to dictate the control flow?
I'm about 80% cetain that isEmpty should return true and headOption should be None on KafkaMessageStream if the stream is empty.
The best way is, before you start reading the stream, get the latest offsets for all of the partitions in the topic and then check when the received offsets have gotten there. If you want to find out how to get the offsets of a topic, see my previous answer on this.
The flow ends up being:
Get the partitions and brokers for the topic
For each broker, create a SimpleConsumer
For each partition, do an OffsetRequest where you return the
earliest and latest offsets (see previous answer)
Then as you read messages, check the offset of the received message
relative to the know last offset for the partition
Once all the offsets received for each partition are the same as the
latest receieved in your OffsetRequest you are done