Exception while accessing KafkaOffset from RDD - scala

I have a Spark consumer which streams from Kafka.
I am trying to manage offsets for exactly-once semantics.
However, while accessing the offset it throws the following exception:
"java.lang.ClassCastException: org.apache.spark.rdd.MapPartitionsRDD
cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges"
The part of the code that does this is as below :
var offsetRanges = Array[OffsetRange]()
dataStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.foreachRDD(rdd => { })
Here dataStream is a direct stream(DStream[String]) created using KafkaUtils API something like :
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+"_"+t)).map(_._2)
If somebody can help me understand what I am doing wrong here.
transform is the first method in the chain of methods performed on datastream as mentioned in the official documentation as well
Thanks.

Your problem is:
.map(._2)
Which creates a MapPartitionedDStream instead of the DirectKafkaInputDStream created by KafkaUtils.createKafkaStream.
You need to map after transform:
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+""+t))
kafkaStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.map(_._2)
.foreachRDD(rdd => // stuff)

Related

spark .txt Kafka message to Data Frame

I am getting error CDRS.toDF() error
case class CDR(phone:String, first_type:String,in_out:String,local:String,duration:String,date:String,time:String,roaming:String,amount:String,in_network:String,is_promo:String,toll_free:String,bytes:String,last_type:String)
// Create direct Kafka stream with brokers and topics
//val topicsSet = Set[String] (kafka_topic)
val topicsSet = Set[String] (kafka_topic)
val kafkaParams = Map[String, String]("metadata.broker.list" ->
kafka_broker)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
//===============================================================================================
//Apply Schema Of Class CDR to Message Coming From Kafka
val CDRS = messages.map(_.split('|')).map(x=> CDR
(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13).repla ceAll("\n","")))

Spark Streaming + Kafka: how to check name of topic from kafka message

I am using Spark Streaming to read from a list of Kafka Topics.
I am following the official API at this link. The method I am using is:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "largest")
val topics = Set(configuration.getKafkaInputTopic())
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
I am wondering how will the executor read from the message from the list of topics ? What will be their policy? Will they read a topic and then when they finish the messages pass to the other topics?
And most importantly, how can I, after calling this method, check what is the topic of a message in the RDD?
stream.foreachRDD(rdd => rdd.map(t => {
val key = t._1
val json = t._2
val topic = ???
})
I am wondering how will the executor read from the message from the
list of topics ? What will be their policy? Will they read a topic and
then when they finish the messages pass to the other topics?
In the direct streaming approach, the driver is responsible for reading the offsets into the Kafka topics you want to consume. What it does it create a mapping between topics, partitions and the offsets that need to be read. After that happens, the driver assigns each worker a range to read into a specific Kafka topic. This means that if a single worker can run 2 tasks simultaneously (just for the sake of the example, it usually can run many more), then it can potentially read from two separate topics of Kafka concurrently.
how can I, after calling this method, check what is the topic of a
message in the RDD?
You can use the overload of createDirectStream which takes a MessageHandler[K, V]:
val topicsToPartitions: Map[TopicAndPartition, Long] = ???
val stream: DStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicsToPartitions,
mam: MessageAndMetadata[String, String]) => (mam.topic(), mam.message())

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Only first message in Kafka stream gets processed

In Spark I create a stream from Kafka with a batch time of 5 seconds. Many messages can come in during that time and I want to process each of them individually, but it seems that with my current logic only the first message of each batch is being processed.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topics)
val messages = stream.map((x$2) => x$2._2)
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
val message = rdd.map(parse)
println(message.collect())
}
}
The parse function simply extracts relevant fields from the Json message into a tuple.
I can drill down into the partitions and process each message individually that way:
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
rdd.foreachPartition { partition =>
partition.foreach{msg =>
val message = parse(msg)
println(message)
}
}
}
}
But I'm certain there is a way to stay at the RDD level. What am I doing wrong in the first example?
I'm using spark 2.0.0, scala 2.11.8 and spark streaming kafka 0.8.
Here is the sample Streaming app which converts each message for the batch in to upper case inside for each loop and prints them. Try this sample app and then recheck your application. Hope this helps.
object SparkKafkaStreaming {
def main(args: Array[String]) {
//Broker and topic
val brokers = "localhost:9092"
val topic = "myTopic"
//Create context with 5 second batch interval
val sparkConf = new SparkConf().setAppName("SparkKafkaStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//Create direct kafka stream with brokers and topics
val topicsSet = Set[String](topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val msgStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Message
val msg = msgStream.map(_._2)
msg.print()
//For each
msg.foreachRDD { rdd =>
if (!rdd.isEmpty) {
println("-----Convert Message to UpperCase-----")
//convert messages to upper case
rdd.map { x => x.toUpperCase() }.collect().foreach(println)
} else {
println("No Message Received")
}
}
//Start the computation
ssc.start()
ssc.awaitTermination()
}
}

Spark streaming if(!rdd.partitions.isEmpty) not working

I'm trying to create a dStream from a kafka server and then do some transformations on that stream. I have included a catch for if the stream is empty (if(!rdd.partitions.isEmpty)); however, even when no events are being published to the kafka topic, the else statement is never reached.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if(!rdd.partitions.isEmpty) {
val message = rdd.map((x$2) => x$2._2).collect().toList.map(parser)
val val = message(0)
} else println("empty stream...")
ssc.start()
ssc.awaitTermination()
}
Is there an alternative statement I should use to check if the stream is empty when using KafkaUtils.createDirectStream rather than createStream?
Use RDD.isEmpty instead of RDD.partitions.isEmpty which adds a check to see if the underlying partition actually has elements:
stream.foreachRDD { rdd =>
if(!rdd.isEmpty) {
// Stuff
}
}
The reason RDD.partitions.isEmpty isn't working is that there exists a partition inside the RDD, but that partition itself is empty. But from the view of partitions which is an Array[Partition], it isn't empty.