Spark streaming if(!rdd.partitions.isEmpty) not working - scala

I'm trying to create a dStream from a kafka server and then do some transformations on that stream. I have included a catch for if the stream is empty (if(!rdd.partitions.isEmpty)); however, even when no events are being published to the kafka topic, the else statement is never reached.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if(!rdd.partitions.isEmpty) {
val message = rdd.map((x$2) => x$2._2).collect().toList.map(parser)
val val = message(0)
} else println("empty stream...")
ssc.start()
ssc.awaitTermination()
}
Is there an alternative statement I should use to check if the stream is empty when using KafkaUtils.createDirectStream rather than createStream?

Use RDD.isEmpty instead of RDD.partitions.isEmpty which adds a check to see if the underlying partition actually has elements:
stream.foreachRDD { rdd =>
if(!rdd.isEmpty) {
// Stuff
}
}
The reason RDD.partitions.isEmpty isn't working is that there exists a partition inside the RDD, but that partition itself is empty. But from the view of partitions which is an Array[Partition], it isn't empty.

Related

How to compare two RDDs in Spark, when data of second stream may not yet be available?

I am working on a Spark app that streams data from two different topics topic_a and topic_b of a Kafka server. I want to consume both streams and check if the data coming from both topics is equal.
val streamingContext = new StreamingContext(sparkContext, Seconds(batchDuration))
val eventStream = KafkaUtils.createDirectStream[String, String](streamingContext, PreferConsistent, Subscribe[String, String](topics, consumerConfig))
def start(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = {
// ...
def cmp(rddA: RDD[ConsumerRecord[String, String]], rddB: RDD[ConsumerRecord[String, String]]): Unit = {
// Do compare...
// but rddA or rddB may be empty! :-(
}
val rddTopicA = rdd.filter(_.topic == 'topic_a')
val rddTopicB = rdd.filter(_.topic == 'topic_b')
cmp(rddTopicA, rddTopicB)
}
eventStream.foreachRDD((x, y) => start((x, y)))
streamingContext.start()
streamingContext.awaitTermination()
The problem is that, when comparing both RDDs in cmp, one of the RDDs may be empty, as the data stream may not yet be available in Kafka. Is it possible to somehow wait until both RDDs have the same amount of rows and then start the comparison? Or first convert the RDD that has data into a DataSet and then temporarily store it for later comparison?

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Only first message in Kafka stream gets processed

In Spark I create a stream from Kafka with a batch time of 5 seconds. Many messages can come in during that time and I want to process each of them individually, but it seems that with my current logic only the first message of each batch is being processed.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topics)
val messages = stream.map((x$2) => x$2._2)
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
val message = rdd.map(parse)
println(message.collect())
}
}
The parse function simply extracts relevant fields from the Json message into a tuple.
I can drill down into the partitions and process each message individually that way:
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
rdd.foreachPartition { partition =>
partition.foreach{msg =>
val message = parse(msg)
println(message)
}
}
}
}
But I'm certain there is a way to stay at the RDD level. What am I doing wrong in the first example?
I'm using spark 2.0.0, scala 2.11.8 and spark streaming kafka 0.8.
Here is the sample Streaming app which converts each message for the batch in to upper case inside for each loop and prints them. Try this sample app and then recheck your application. Hope this helps.
object SparkKafkaStreaming {
def main(args: Array[String]) {
//Broker and topic
val brokers = "localhost:9092"
val topic = "myTopic"
//Create context with 5 second batch interval
val sparkConf = new SparkConf().setAppName("SparkKafkaStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//Create direct kafka stream with brokers and topics
val topicsSet = Set[String](topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val msgStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Message
val msg = msgStream.map(_._2)
msg.print()
//For each
msg.foreachRDD { rdd =>
if (!rdd.isEmpty) {
println("-----Convert Message to UpperCase-----")
//convert messages to upper case
rdd.map { x => x.toUpperCase() }.collect().foreach(println)
} else {
println("No Message Received")
}
}
//Start the computation
ssc.start()
ssc.awaitTermination()
}
}

Exception while accessing KafkaOffset from RDD

I have a Spark consumer which streams from Kafka.
I am trying to manage offsets for exactly-once semantics.
However, while accessing the offset it throws the following exception:
"java.lang.ClassCastException: org.apache.spark.rdd.MapPartitionsRDD
cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges"
The part of the code that does this is as below :
var offsetRanges = Array[OffsetRange]()
dataStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.foreachRDD(rdd => { })
Here dataStream is a direct stream(DStream[String]) created using KafkaUtils API something like :
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+"_"+t)).map(_._2)
If somebody can help me understand what I am doing wrong here.
transform is the first method in the chain of methods performed on datastream as mentioned in the official documentation as well
Thanks.
Your problem is:
.map(._2)
Which creates a MapPartitionedDStream instead of the DirectKafkaInputDStream created by KafkaUtils.createKafkaStream.
You need to map after transform:
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+""+t))
kafkaStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.map(_._2)
.foreachRDD(rdd => // stuff)

kafka directstream dstream map does not print

I have this simple Kafka Stream
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Each Kafka message is a flight
val flights = messages.map(_._2)
flights.foreachRDD( rdd => {
println("--- New RDD with " + rdd.partitions.length + " partitions and " + rdd.count() + " flight records");
rdd.map { flight => {
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows)
}
}
})
ssc.start()
ssc.awaitTermination()
Kafka has messages, Spark Streaming it able to get them as RDDs. But the second println in my code does not print anything. i looked at driver console logs when ran in local[2] mode, checked yarn logs when ran in yarn-client mode.
What am I missing?
Instead of rdd.map, the following code prints well in spark driver console:
for(flight <- rdd.collect().toArray) {
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows)
}
But I'm afraid that processing on this flight object might happen in spark driver project, instead of executor. Please correct me if i'm wrong.
Thanks
rdd.map is a lazy transformation. It won't be materialized unless an action is called on that RDD.
In this specific case, we could use rdd.foreach which is one of the most generic actions on RDD, giving us access to each element in the RDD.
flights.foreachRDD{ rdd =>
rdd.foreach { flight =>
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows) // prints on the stdout of each executor independently
}
}
Given that this RDD action is executed in the executors, we will find the println output in the executor's STDOUT.
If you would like to print the data on the driver instead, you can collect the data of the RDD within the DStream.foreachRDD closure.
flights.foreachRDD{ rdd =>
val allFlights = rdd.collect()
println(allFlights.mkString("\n")) // prints to the stdout of the driver
}