How to apply RDD function on DStream while writing code in scala - scala

I am trying to write a simple Spark code in Scala.
Here I am getting a DStream. I am successfully able to print this DStream. But when I am trying to do any kind of "foreach" ,"foreachRDD" or "transform" function on this DStream then during execution of my program my console is just getting freeze. Here I am not getting any error but the console is just becoming non-responsive until I manually terminate eclipse console operation. I am attaching the the code here. Kindly tell me what am I doing wrong.
My main objective is to apply RDD operations on DStream and in order to do it as per my knowledge I need to convert my DStream into RDD by using "foreach" ,"foreachRDD" or "transform" function.
I have already achieved same by using Java. But in scala I am having this problem.
Is anybody else facing the same issue? If not then kindly help me out. Thanks
I am Writing a sample code here
object KafkaStreaming {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val ssc = new StreamingContext("local", "KafkaWordCount", Seconds(2))
val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val splitLines:DStream[String] = lines.flatMap(_.split("\n"))
val pairAlarm = splitLines.map(
x=>{
//Some Code
val alarmPair = new Tuple2(key, value)
alarmPair
}
)
//pairAlarm.print
pairAlarm.foreachRDD(x=>{
println("1 : "+x.first)
x.collect // When the execution reaching this part its getting freeze
println("2: "+x.first)
})
ssc.start()
ssc.awaitTermination()
}
}

I don't know if this is your problem, but I had a similar one. My program just stopped printing after several iterations. No exceptions etc. just stops printing after 5-6 prints.
Changing this:
val ssc = new StreamingContext("local", "KafkaWordCount", Seconds(2))
to this:
val ssc = new StreamingContext("local[2]", "KafkaWordCount", Seconds(2))
solved the problem. Spark requires at least 2 threads to run and the documentation examples are misleading as they use just local as well.
Hope this helps!

Related

Spark Streaming 1.6 + Kafka: Too many batches in "queued" status

I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. I'm using direct approach to consume from kafka and the code can be found below:
def createStreamingContext(conf: Conf): StreamingContext = {
val dateFormat = conf.dateFormat.apply
val hiveTable = conf.tableName.apply
val sparkConf = new SparkConf()
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.driver.allowMultipleContexts", "true")
val sc = SparkContextBuilder.build(Some(sparkConf))
val ssc = new StreamingContext(sc, Seconds(conf.batchInterval.apply))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> conf.kafkaBrokers.apply,
"key.deserializer" -> classOf[StringDeserializer].getName,
"value.deserializer" -> classOf[StringDeserializer].getName,
"auto.offset.reset" -> "smallest",
"enable.auto.commit" -> "false"
)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
conf.topics.apply().split(",").toSet[String]
)
val windowedKafkaStream = directKafkaStream.window(Seconds(conf.windowDuration.apply))
ssc.checkpoint(conf.sparkCheckpointDir.apply)
val eirRDD: DStream[Row] = windowedKafkaStream.map { kv =>
val fields: Array[String] = kv._2.split(",")
createDomainObject(fields, dateFormat)
}
eirRDD.foreachRDD { rdd =>
val schema = SchemaBuilder.build()
val sqlContext: HiveContext = HiveSQLContext.getInstance(Some(rdd.context))
val eirDF: DataFrame = sqlContext.createDataFrame(rdd, schema)
eirDF
.select(schema.map(c => col(c.name)): _*)
.write
.mode(SaveMode.Append)
.partitionBy("year", "month", "day")
.insertInto(hiveTable)
}
ssc
}
As it can be seen from the code, I used window to achieve this (and please correct me if I'm wrong): Since there's an action to insert into a hive table, I want to avoid writing to HDFS too often, so what I want is to hold enough data in memory and only then write to the filesystem. I thought that using window would be the right way to achieve it.
Now, in the image below, you can see that there are many batches being queued and the batch being processed, takes forever to complete.
I'm also providing the details of the single batch being processed:
Why are there so many tasks for the insert action, when there aren't many events in the batch? Sometimes having 0 events also generates thousands of tasks that take forever to complete.
Is the way I process microbatches with Spark wrong?
Thanks for your help!
Some extra details:
Yarn containers have a max of 2gb.
In this Yarn queue, the maximum number of containers is 10.
When I look at details of the queue where this spark application is being executed, the number of containers is extremely large, around 15k pending containers.
Well, I finally figured it out. Apparently Spark Streaming does not get along with empty events, so inside the foreachRDD portion of the code, I added the following:
eirRDD.foreachRDD { rdd =>
if (rdd.take(1).length != 0) {
//do action
}
}
That way we skip empty micro-batches. the isEmpty() method does not work.
Hope this help somebody else! ;)

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

Structured Streaming - Foreach Sink

I am basically reading from a Kafka source, and dumping each message through to my foreach processor (Thanks Jacek's page for the simple example).
If this actually works, i shall actually perform some business logic in the process method here, however, this doesn't work. I believe that the println doesn't work since its running on executors and there is no way for getting those logs back to driver. However, this insert into a temp table should at least work and show me that the messages are actually consumed and processed through to the sink.
What am I missing here ?
Really looking for a second set of eyes to check my effort here:
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", Util.getProperty("kafka10.broker"))
.option("subscribe", src_topic)
.load()
val rec = stream.selectExpr("CAST(value AS STRING) as txnJson").as[(String)]
val df = stream.selectExpr("cast (value as string) as json")
val writer = new ForeachWriter[Row] {
val scon = new SConConnection
override def open(partitionId: Long, version: Long) = {
true
}
override def process(value: Row) = {
println("++++++++++++++++++++++++++++++++++++" + value.get(0))
scon.executeUpdate("insert into rs_kafka10(miscCol) values("+value.get(0)+")")
}
override def close(errorOrNull: Throwable) = {
scon.closeConnection
}
}
val yy = df.writeStream
.queryName("ForEachQuery")
.foreach(writer)
.outputMode("append")
.start()
yy.awaitTermination()
Thanks for comments from Harald and others, I found out a couple of things, which led me to achieve normal processing behaviour -
test code with local mode, yarn isnt the biggest help in debugging
for some reason, the process method of foreach sink doesnt allow calling other methods. When i put my business logic directly in there, it works.
hope it helps others.

Spark Streaming with Kafka

I want to do simple machine learning in Spark.
First the application should do some learning from historical data from a file, train the machine learning model and then read input from kafka to give predictions in real time. To do that I believe I should use spark streaming. However, I'm afraid that I don't really understand how spark streaming works.
The code looks like this:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("test App")
val sc = new SparkContext(conf)
val fromFile = parse(sc, Source.fromFile("my_data_.csv").getLines.toArray)
ML.train(fromFile)
real_time(sc)
}
Where ML is a class with some machine learning things in it and train gives it data to train. There also is a method classify which calculates predictions based on what it learned.
The first part seems to work fine, but real_time is a problem:
def real_time(sc: SparkContext) : Unit = {
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val topic = "my_topic".split(",").toSet
val params = Map[String, String](("metadata.broker.list", "localhost:9092"))
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topic)
var lin = dstream.map(_._2)
val str_arr = new Array[String](0)
lin.foreach {
str_arr :+ _.collect()
}
val lines = parse(sc, str_arr).map(i => i.features)
ML.classify(lines)
ssc.start()
ssc.awaitTermination()
}
What I would like it to do is check the Kafka stream and compute it if there are any new lines. This doesn't seem to be the case, I added some prints and it is not printed.
How to use spark streaming, how should it be used in my case?

Spark Streaming MQTT

I've been using spark to stream data from kafka and it's pretty easy.
I thought using the MQTT utils would also be easy, but it is not for some reason.
I'm trying to execute the following piece of code.
val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val actorSystem = ActorSystem()
implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor])
MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
.foreachRDD { rdd =>
println("got rdd: " + rdd.toString())
rdd.foreach { msg =>
println("got msg: " + msg)
}
}
ssc.start()
ssc.awaitTermination()
The weird thing is that spark logs the msg I sent in the console, but not my println.
It logs something like this:
19:38:18.803 [RecurringTimer - BlockGenerator] DEBUG
o.a.s.s.receiver.BlockGenerator - Last element in
input-0-1435790298600 is SOME MESSAGE
foreach is a distributed action, so your println may be executing on the workers. If you want to see some of the messages printed out locally, you could use the built in print function on the DStream or instead of your foreachRDD collect (or take) some of the elements back to the driver and print them there. Hope that helps and best of luck with Spark Streaming :)
If you wish to just print incoming messages, try something like this instead of the for_each (translating from a working Python version, so do check for Scala typos):
val mqttStream = MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
mqttStream.print()