Spark Streaming MQTT - scala

I've been using spark to stream data from kafka and it's pretty easy.
I thought using the MQTT utils would also be easy, but it is not for some reason.
I'm trying to execute the following piece of code.
val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val actorSystem = ActorSystem()
implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor])
MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
.foreachRDD { rdd =>
println("got rdd: " + rdd.toString())
rdd.foreach { msg =>
println("got msg: " + msg)
}
}
ssc.start()
ssc.awaitTermination()
The weird thing is that spark logs the msg I sent in the console, but not my println.
It logs something like this:
19:38:18.803 [RecurringTimer - BlockGenerator] DEBUG
o.a.s.s.receiver.BlockGenerator - Last element in
input-0-1435790298600 is SOME MESSAGE

foreach is a distributed action, so your println may be executing on the workers. If you want to see some of the messages printed out locally, you could use the built in print function on the DStream or instead of your foreachRDD collect (or take) some of the elements back to the driver and print them there. Hope that helps and best of luck with Spark Streaming :)

If you wish to just print incoming messages, try something like this instead of the for_each (translating from a working Python version, so do check for Scala typos):
val mqttStream = MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
mqttStream.print()

Related

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

Kafka producer.send() is Stopped by producer.close()

I am trying to send the output of a word-count problem (in spark- scala) on a kafka topic named "test". See Below Code:
val Dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = Dstream.map(f => f._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.foreachRDD(
rdd => rdd.foreach(
f =>
{
val sendProps = new Properties()
sendProps.put("metadata.broker.list", brokers)
sendProps.put("serializer.class", "kafka.serializer.StringEncoder")
sendProps.put("producer.type", "async")
val config = new ProducerConfig(sendProps)
val producer = new Producer[String, String](config)
producer.send(new KeyedMessage[String, String]"test", f._1 + " " +f._2))
producer.close();
}))
The problem is some words are missing in output randomly. I also noticed that if I removed the statement
producer.close()
there is no data loss.
Does this mean producer.close() interrupts producer.send() before it actually puts data in buffer due to which that particular tuple is not being sent to consumer? If Yes, How shall I close producer without risking data loss?
Above was my initial problem and solved by Vale's answer.
Now, when I change producer.type property again - data goes missing randomly.
sendProps.put("producer.type", "sync")
To clarify producer.send is running for all the words I need to put in output Topic. But, some words go missing and are not displayed in output Kafka Topic.
This is weird. The close() method should wait for the send to have finished, and this was why a close(time) method was introduced: as you can see here.
So, I use Java 7. Is the rdd.foreach operating on each partition inside it? Or is it operating on each Tuple (as I think it's doing)?
If the latter, could you try a rdd.foreachPartition (refer to this)? Because you are creating a producer for each line you take, and I fear this could be causing problems (although theoretically it shouldn't).

kafka directstream dstream map does not print

I have this simple Kafka Stream
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Each Kafka message is a flight
val flights = messages.map(_._2)
flights.foreachRDD( rdd => {
println("--- New RDD with " + rdd.partitions.length + " partitions and " + rdd.count() + " flight records");
rdd.map { flight => {
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows)
}
}
})
ssc.start()
ssc.awaitTermination()
Kafka has messages, Spark Streaming it able to get them as RDDs. But the second println in my code does not print anything. i looked at driver console logs when ran in local[2] mode, checked yarn logs when ran in yarn-client mode.
What am I missing?
Instead of rdd.map, the following code prints well in spark driver console:
for(flight <- rdd.collect().toArray) {
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows)
}
But I'm afraid that processing on this flight object might happen in spark driver project, instead of executor. Please correct me if i'm wrong.
Thanks
rdd.map is a lazy transformation. It won't be materialized unless an action is called on that RDD.
In this specific case, we could use rdd.foreach which is one of the most generic actions on RDD, giving us access to each element in the RDD.
flights.foreachRDD{ rdd =>
rdd.foreach { flight =>
val flightRows = FlightParser.parse(flight)
println ("Parsed num rows: " + flightRows) // prints on the stdout of each executor independently
}
}
Given that this RDD action is executed in the executors, we will find the println output in the executor's STDOUT.
If you would like to print the data on the driver instead, you can collect the data of the RDD within the DStream.foreachRDD closure.
flights.foreachRDD{ rdd =>
val allFlights = rdd.collect()
println(allFlights.mkString("\n")) // prints to the stdout of the driver
}

Pushing Spark Streaming RDDs to Neo4j -Scala

I need to establish a connection from Spark Streaming to Neo4j graph database.The RDDs are of type((is,I),(am,Hello)(sam,happy)....). I need to establish a edge between each pair of words in Neo4j.
In Spark Streaming documentation I found
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
to the push to the data to an external database.
I am doing this in Scala. I am little confused about how to go about? I found AnormCypher and Neo4jScala wrapper. Can I use these to get work done? If so, how can I do that? If not, all there any better alternatives?
Thank you all....
I did an experiment with AnormCypher
Like this:
implicit val connection = Neo4jREST.setServer("localhost", 7474, "/db/data/")
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(FILE, 4).cache()
val count = logData
.flatMap( _.split(" "))
.map( w =>
Cypher("CREATE(:Word {text:{text}})")
.on( "text" -> w ).execute()
).filter( _ ).count()
Neo4j 2.2.x has great concurrent write performance that you can use from Spark. So the more concurrent threads you can have to write to Neo4j the better. If you can batch statements in batches of 100 to 1000 each per request then even better.
Take a look at MazeRunner (http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html) as it will give you some ideas.

How to apply RDD function on DStream while writing code in scala

I am trying to write a simple Spark code in Scala.
Here I am getting a DStream. I am successfully able to print this DStream. But when I am trying to do any kind of "foreach" ,"foreachRDD" or "transform" function on this DStream then during execution of my program my console is just getting freeze. Here I am not getting any error but the console is just becoming non-responsive until I manually terminate eclipse console operation. I am attaching the the code here. Kindly tell me what am I doing wrong.
My main objective is to apply RDD operations on DStream and in order to do it as per my knowledge I need to convert my DStream into RDD by using "foreach" ,"foreachRDD" or "transform" function.
I have already achieved same by using Java. But in scala I am having this problem.
Is anybody else facing the same issue? If not then kindly help me out. Thanks
I am Writing a sample code here
object KafkaStreaming {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val ssc = new StreamingContext("local", "KafkaWordCount", Seconds(2))
val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val splitLines:DStream[String] = lines.flatMap(_.split("\n"))
val pairAlarm = splitLines.map(
x=>{
//Some Code
val alarmPair = new Tuple2(key, value)
alarmPair
}
)
//pairAlarm.print
pairAlarm.foreachRDD(x=>{
println("1 : "+x.first)
x.collect // When the execution reaching this part its getting freeze
println("2: "+x.first)
})
ssc.start()
ssc.awaitTermination()
}
}
I don't know if this is your problem, but I had a similar one. My program just stopped printing after several iterations. No exceptions etc. just stops printing after 5-6 prints.
Changing this:
val ssc = new StreamingContext("local", "KafkaWordCount", Seconds(2))
to this:
val ssc = new StreamingContext("local[2]", "KafkaWordCount", Seconds(2))
solved the problem. Spark requires at least 2 threads to run and the documentation examples are misleading as they use just local as well.
Hope this helps!