I am using Spark Streaming to read from a list of Kafka Topics.
I am following the official API at this link. The method I am using is:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "largest")
val topics = Set(configuration.getKafkaInputTopic())
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
I am wondering how will the executor read from the message from the list of topics ? What will be their policy? Will they read a topic and then when they finish the messages pass to the other topics?
And most importantly, how can I, after calling this method, check what is the topic of a message in the RDD?
stream.foreachRDD(rdd => rdd.map(t => {
val key = t._1
val json = t._2
val topic = ???
})
I am wondering how will the executor read from the message from the
list of topics ? What will be their policy? Will they read a topic and
then when they finish the messages pass to the other topics?
In the direct streaming approach, the driver is responsible for reading the offsets into the Kafka topics you want to consume. What it does it create a mapping between topics, partitions and the offsets that need to be read. After that happens, the driver assigns each worker a range to read into a specific Kafka topic. This means that if a single worker can run 2 tasks simultaneously (just for the sake of the example, it usually can run many more), then it can potentially read from two separate topics of Kafka concurrently.
how can I, after calling this method, check what is the topic of a
message in the RDD?
You can use the overload of createDirectStream which takes a MessageHandler[K, V]:
val topicsToPartitions: Map[TopicAndPartition, Long] = ???
val stream: DStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicsToPartitions,
mam: MessageAndMetadata[String, String]) => (mam.topic(), mam.message())
Related
Running with a Consumer.plainSource, nothing happens. Isn't this one way of reading from kafka topic through streams .
val consumerSettings2 = ConsumerSettings(system,new StringDeserializer,new StringDeserializer)
.withBootstrapServers("localhost:3333")
.withGroupId("ssss")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
val source: Source[ConsumerRecord[String, String], Consumer.Control] =
Consumer.plainSource(consumerSettings2, Subscriptions.topics("candy"))
val sink =
Sink.foreach[ConsumerRecord[String,String]](x=>println("consumed "+x))
source.runWith(sink)
I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you
Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.
You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).
You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.
Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()
I am playing with Spark Streaming and Kafka (with the Scala API), and would like to read message from a set of Kafka topics with Spark Streaming.
The following method:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "smallest")
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
reads from Kafka to the latest available offset, but doesn't give me the metadata that I need (since I am reading from a set of topics, I need for every message I read that topic) but this other method KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple2[String, String]](ssc, kafkaParams, currentOffsets, messageHandler) wants explicitly an offset that I don't have.
I know that there is this shell command that gives you the last offset.
kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list <broker>: <port>
--topic <topic-name> --time -1 --offsets 1
and KafkaCluster.scala is an API that is for developers that used to be public and gives you exactly what I would like.
Hint?
You can use the code from GetOffsetShell.scala kafka API documentation
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, nOffsets)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
Or you can create new consumer with unique groupId and use it for getting first offset
val consumer=new KafkaConsumer[String, String](createConsumerConfig(config.brokerList))
consumer.partitionsFor(config.topic).foreach(pi => {
val topicPartition = new TopicPartition(pi.topic(), pi.partition())
consumer.assign(List(topicPartition))
consumer.seekToBeginning()
val firstOffset = consumer.position(topicPartition)
...
I'm trying to consume a specific partition of a Kafka topic using Spark Streaming.
I don't see any methods for this use case in KafkaUtils class.
There is a method called createRDD, which is basically expecting offsets and it is useful only for non-streaming applications. Is there any other way can i consume a specific partition of Kafka topic using Spark Streaming?
There isn't a way to consume a single partition, the most granular we can work with is a topic. But, there is a way to specify that say that a given message was originated from a specific partition. You can do this when using the overload of createDirectStream which takes a Function1[MessageAndMetadata, R].
For example, let's assume we have a key and message of type String, and that we're currently only consuming from a single topic. We can do:
val topicAndPartition: Map[TopicAndPartition, Long] = ???
val kafkaProperties: Map[String, String] = ???
KafkaUtils.createDirectStream[String,
String,
StringDecoder,
StringDecoder,
(String, String)](
streamingContext,
kafkaConfig.properties,
topicAndPartition,
(mam: MessageAndMetadata[String, String]) =>
(mam.partition, mam.message())
This way, I'm outputting a tuple of a partition (1) and the underlying message (2). Then, I can filter this DStream[(String, String)] to contain only messages from a specific partition:
val filteredStream = kafkaDStream.filter { case (partition, _) => partition == 4 }
If we're consuming from multiple topics, we'll need to output a tuple of both topic and partition in order to filter the partition with the right topic. Luckily, there's already a handy case class called TopicAndPartition we can use. We'd have:
(mam: MessageAndMetadata[String, String]) =>
(TopicAndPartition(mam.topic(), mam.partition()), mam.message())
And then:
val filteredStream = kafkaDStream.filter {
case (tap, _) => tap.topic == "mytopic" && tap.partition == 4
}
If I feed my Spark-Streaming Application more than one topic like so:
val ssc = new StreamingContext(sc, Seconds(2))
val topics = Set("raw_1", "raw_2)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
When I run my application how can I figure out from my stream what the difference between which topic it is pulling from? Is there a way to do? If I do something like
val lines = stream.print()
I am getting nothing of differentiation. Is the only way to do it to make the Kafka Message Key a denoting factor?
Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata.
You can find example implementation here .