kafka directstream dstream map does not print - scala

I have this simple Kafka Stream
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Each Kafka message is a flight
val flights = messages.map(_._2)
flights.foreachRDD( rdd => {
println("--- New RDD with " + rdd.partitions.length + " partitions and " + rdd.count() + " flight records");
rdd.map { flight => {
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows)
}
}
})
ssc.start()
ssc.awaitTermination()
Kafka has messages, Spark Streaming it able to get them as RDDs. But the second println in my code does not print anything. i looked at driver console logs when ran in local[2] mode, checked yarn logs when ran in yarn-client mode.
What am I missing?
Instead of rdd.map, the following code prints well in spark driver console:
for(flight <- rdd.collect().toArray) {
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows)
}
But I'm afraid that processing on this flight object might happen in spark driver project, instead of executor. Please correct me if i'm wrong.
Thanks

rdd.map is a lazy transformation. It won't be materialized unless an action is called on that RDD.
In this specific case, we could use rdd.foreach which is one of the most generic actions on RDD, giving us access to each element in the RDD.
flights.foreachRDD{ rdd =>
rdd.foreach { flight =>
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows) // prints on the stdout of each executor independently
}
}
Given that this RDD action is executed in the executors, we will find the println output in the executor's STDOUT.
If you would like to print the data on the driver instead, you can collect the data of the RDD within the DStream.foreachRDD closure.
flights.foreachRDD{ rdd =>
val allFlights = rdd.collect()
println(allFlights.mkString("\n")) // prints to the stdout of the driver
}

Related

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Only first message in Kafka stream gets processed

In Spark I create a stream from Kafka with a batch time of 5 seconds. Many messages can come in during that time and I want to process each of them individually, but it seems that with my current logic only the first message of each batch is being processed.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topics)
val messages = stream.map((x$2) => x$2._2)
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
val message = rdd.map(parse)
println(message.collect())
}
}
The parse function simply extracts relevant fields from the Json message into a tuple.
I can drill down into the partitions and process each message individually that way:
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
rdd.foreachPartition { partition =>
partition.foreach{msg =>
val message = parse(msg)
println(message)
}
}
}
}
But I'm certain there is a way to stay at the RDD level. What am I doing wrong in the first example?
I'm using spark 2.0.0, scala 2.11.8 and spark streaming kafka 0.8.
Here is the sample Streaming app which converts each message for the batch in to upper case inside for each loop and prints them. Try this sample app and then recheck your application. Hope this helps.
object SparkKafkaStreaming {
def main(args: Array[String]) {
//Broker and topic
val brokers = "localhost:9092"
val topic = "myTopic"
//Create context with 5 second batch interval
val sparkConf = new SparkConf().setAppName("SparkKafkaStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//Create direct kafka stream with brokers and topics
val topicsSet = Set[String](topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val msgStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Message
val msg = msgStream.map(_._2)
msg.print()
//For each
msg.foreachRDD { rdd =>
if (!rdd.isEmpty) {
println("-----Convert Message to UpperCase-----")
//convert messages to upper case
rdd.map { x => x.toUpperCase() }.collect().foreach(println)
} else {
println("No Message Received")
}
}
//Start the computation
ssc.start()
ssc.awaitTermination()
}
}

Spark streaming if(!rdd.partitions.isEmpty) not working

I'm trying to create a dStream from a kafka server and then do some transformations on that stream. I have included a catch for if the stream is empty (if(!rdd.partitions.isEmpty)); however, even when no events are being published to the kafka topic, the else statement is never reached.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if(!rdd.partitions.isEmpty) {
val message = rdd.map((x$2) => x$2._2).collect().toList.map(parser)
val val = message(0)
} else println("empty stream...")
ssc.start()
ssc.awaitTermination()
}
Is there an alternative statement I should use to check if the stream is empty when using KafkaUtils.createDirectStream rather than createStream?
Use RDD.isEmpty instead of RDD.partitions.isEmpty which adds a check to see if the underlying partition actually has elements:
stream.foreachRDD { rdd =>
if(!rdd.isEmpty) {
// Stuff
}
}
The reason RDD.partitions.isEmpty isn't working is that there exists a partition inside the RDD, but that partition itself is empty. But from the view of partitions which is an Array[Partition], it isn't empty.

KafkaRDD scala minimal example

I'm trying to get a running example using KafkaRDD:
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val offsetRanges = Array(
OffsetRange("topic", 0, 0, 2)
)
val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](sc, kafkaParams, offsetRanges)
rdd.map(x => println(x)).collect()
res: Array[Unit] = Array((), ())
I have been careful in creating "topic" with a single partition and writing 2 messages, hello, world.
I can get what looks like a correct RDD, but how can I access its content? Am I missing something?
Thanks, E.
The problem is this line, I believe:
rdd.map(x => println(x)).collect()
The way an RDD works, rdd.map runs on the executor. When you println it's printing it to stdout for the executor. To print it to stdout in the driver application, try this instead:
rdd.collect().map(x => println(x))

Spark Streaming MQTT

I've been using spark to stream data from kafka and it's pretty easy.
I thought using the MQTT utils would also be easy, but it is not for some reason.
I'm trying to execute the following piece of code.
val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val actorSystem = ActorSystem()
implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor])
MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
.foreachRDD { rdd =>
println("got rdd: " + rdd.toString())
rdd.foreach { msg =>
println("got msg: " + msg)
}
}
ssc.start()
ssc.awaitTermination()
The weird thing is that spark logs the msg I sent in the console, but not my println.
It logs something like this:
19:38:18.803 [RecurringTimer - BlockGenerator] DEBUG
o.a.s.s.receiver.BlockGenerator - Last element in
input-0-1435790298600 is SOME MESSAGE
foreach is a distributed action, so your println may be executing on the workers. If you want to see some of the messages printed out locally, you could use the built in print function on the DStream or instead of your foreachRDD collect (or take) some of the elements back to the driver and print them there. Hope that helps and best of luck with Spark Streaming :)
If you wish to just print incoming messages, try something like this instead of the for_each (translating from a working Python version, so do check for Scala typos):
val mqttStream = MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
mqttStream.print()