Manually Commit offsetRanges to Kafka when using KafkaUtils.createDirectStream - apache-kafka

I am trying to Consume data from Kafka (0.10.0.0) to Spark (1.6.0) Streaming Application using Kafka Utils Api
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, inputTopicsSet)
The requirement is to commit the Offset Ranges manually to the Kafka itself.
Being aware that when using KafkaConsumer (or Consumer) objects in java we can achieve this using commitAsync or commitSync methods after setting "enable.auto.commit" = "false" in params.
I am not able to figure out the the method to do the same when using KafkaUtils.

You can pass "enable.auto.commit" = "false" as part of kafkaParams. In fact you can pass any Kafka consumer setting as part of that.
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "localhost:9092,anotherhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("enable.auto.commit", true);
//more kafka params goes here if needed
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, inputTopicsSet)
And then commit offset manually like this,
stream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// some time later, after outputs have completed
((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
});
Reference: https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html#creating-a-direct-stream
Above code is for spark 2.2.0, it may not be applicable for spark 1.6.0.

Related

Kafka 2.3.0 producer and consumer

I'm new to kafka,and want to use Kafka 2.3 to implement a producer/consumer app.
I had download and install the Kafka 2.3 on my ubuntu server.
I found some code online and build it on my laptop in IDEA, But the consumer can't get any info.
I had checked the topic info on my server which has the topic.
I had use kafka-console-consumer to check this topic, got the topic's value successfuly, but not with my consumer.
So what's wrong with my consumer?
Producer
package com.phitrellis.tool
import java.util.Properties
import java.util.concurrent.{Future, TimeUnit}
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.clients.producer._
object MyKafkaProducer extends App {
def createKafkaProducer(): Producer[String, String] = {
val props = new Properties()
props.put("bootstrap.servers", "*:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("producer.type", "async")
props.put("acks", "all")
new KafkaProducer[String, String](props)
}
def writeToKafka(topic: String): Unit = {
val producer = createKafkaProducer()
val record = new ProducerRecord[String, String](topic, "key", "value22222222222")
println("start")
producer.send(record)
producer.close()
println("end")
}
writeToKafka("phitrellis")
}
Consumer
package com.phitrellis.tool
import java.util
import java.util.Properties
import java.time.Duration
import scala.collection.JavaConverters._
import org.apache.kafka.clients.consumer.KafkaConsumer
object MyKafkaConsumer extends App {
def createKafkaConsumer(): KafkaConsumer[String, String] = {
val props = new Properties()
props.put("bootstrap.servers", "*:9092")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
// props.put("auto.offset.reset", "latest")
props.put("enable.auto.commit", "true")
props.put("auto.commit.interval.ms", "1000")
props.put("group.id", "test")
new KafkaConsumer[String, String](props)
}
def consumeFromKafka(topic: String) = {
val consumer: KafkaConsumer[String, String] = createKafkaConsumer()
consumer.subscribe(util.Arrays.asList(topic))
while (true) {
val records = consumer.poll(Duration.ofSeconds(2)).asScala.iterator
println("true")
for (record <- records){
print(record.value())
}
}
}
consumeFromKafka("phitrellis")
}
Two line in your Consumer code are crucial:
props.put("auto.offset.reset", "latest")
props.put("group.id", "test")
To read from beginning of the topic you have to set auto.offset.reset to earliest (latest cause that you skip messages produced before your Consumer started).
group.id is responsible for group management. If you start processing data with some group.id and than restart your application or start new with same group.id only new messages will be read.
For your tests I would suggest to add auto.offset.reset -> earliest and change group.id
props.put("auto.offset.reset", "earliest")
props.put("group.id", "test123")
Additionally:
You have to remember that KafkaProducer::send returns Future<RecordMetadata> and messages are sent asynchronously and if you progam finished before Future will finished messages might not be sent.
There's two parts here. The producing side, and the consumer.
You don't say anything about the producer, so we're assuming it did work. However, did you check on the servers? You could check the kafka log files to see if there's any data on those particular topic/partitions.
On the consumer side, to validate, you should try to consume using the command-line from that same topic, to make sure the data is in there. Look for "Kafka Consumer Console" at the following link, and follow those steps.
http://cloudurable.com/blog/kafka-tutorial-kafka-from-command-line/index.html
If there is data on the topic, then running that command should get you data. If it's not, then it will just "hang" because it's waiting for data to be written to the topic.
In addition, you can try producing to the same topic using those command line tools, to make sure your cluster is configured correctly, you have the right addresses and ports, that the ports are not blocked, etc.

Apache Kafka: How to receive latest message from Kafka?

I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you
Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.
You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).
You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.
Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()

kafka and Spark: Get first offset of a topic via API

I am playing with Spark Streaming and Kafka (with the Scala API), and would like to read message from a set of Kafka topics with Spark Streaming.
The following method:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "smallest")
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
reads from Kafka to the latest available offset, but doesn't give me the metadata that I need (since I am reading from a set of topics, I need for every message I read that topic) but this other method KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple2[String, String]](ssc, kafkaParams, currentOffsets, messageHandler) wants explicitly an offset that I don't have.
I know that there is this shell command that gives you the last offset.
kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list <broker>: <port>
--topic <topic-name> --time -1 --offsets 1
and KafkaCluster.scala is an API that is for developers that used to be public and gives you exactly what I would like.
Hint?
You can use the code from GetOffsetShell.scala kafka API documentation
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, nOffsets)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
Or you can create new consumer with unique groupId and use it for getting first offset
val consumer=new KafkaConsumer[String, String](createConsumerConfig(config.brokerList))
consumer.partitionsFor(config.topic).foreach(pi => {
val topicPartition = new TopicPartition(pi.topic(), pi.partition())
consumer.assign(List(topicPartition))
consumer.seekToBeginning()
val firstOffset = consumer.position(topicPartition)
...

Spark Streaming + Kafka: how to check name of topic from kafka message

I am using Spark Streaming to read from a list of Kafka Topics.
I am following the official API at this link. The method I am using is:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "largest")
val topics = Set(configuration.getKafkaInputTopic())
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
I am wondering how will the executor read from the message from the list of topics ? What will be their policy? Will they read a topic and then when they finish the messages pass to the other topics?
And most importantly, how can I, after calling this method, check what is the topic of a message in the RDD?
stream.foreachRDD(rdd => rdd.map(t => {
val key = t._1
val json = t._2
val topic = ???
})
I am wondering how will the executor read from the message from the
list of topics ? What will be their policy? Will they read a topic and
then when they finish the messages pass to the other topics?
In the direct streaming approach, the driver is responsible for reading the offsets into the Kafka topics you want to consume. What it does it create a mapping between topics, partitions and the offsets that need to be read. After that happens, the driver assigns each worker a range to read into a specific Kafka topic. This means that if a single worker can run 2 tasks simultaneously (just for the sake of the example, it usually can run many more), then it can potentially read from two separate topics of Kafka concurrently.
how can I, after calling this method, check what is the topic of a
message in the RDD?
You can use the overload of createDirectStream which takes a MessageHandler[K, V]:
val topicsToPartitions: Map[TopicAndPartition, Long] = ???
val stream: DStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicsToPartitions,
mam: MessageAndMetadata[String, String]) => (mam.topic(), mam.message())

Spark-Stream Differentiating Kafka Topics

If I feed my Spark-Streaming Application more than one topic like so:
val ssc = new StreamingContext(sc, Seconds(2))
val topics = Set("raw_1", "raw_2)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
When I run my application how can I figure out from my stream what the difference between which topic it is pulling from? Is there a way to do? If I do something like
val lines = stream.print()
I am getting nothing of differentiation. Is the only way to do it to make the Kafka Message Key a denoting factor?
Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata.
You can find example implementation here .