Spark streaming: How to write cumulative output? - scala

I have to write a single output file for my streaming job.
Question : when will my job actually stop? I killed the server but did not work.
I want to stop my job from commandline(If it is possible)
Code:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
object MAYUR_BELDAR_PROGRAM5_V1 {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.socketTextStream("localhost", args(0).toInt)
val words = lines.flatMap(_.split(" "))
val class1 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==0)
val class2 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==1)
val class3 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==0)
val class4 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==1)
class1.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class1","txt")
class2.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class2", "txt")
class3.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class3","txt")
class4.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class4","txt")
ssc.start() // Start the computation
ssc.awaitTermination()
ssc.stop()
}
}

A stream by definition does not have an end so it will not stop unless you call the method to stop it. In my case I have a business condition that tell when the process is finished, so when I reach this point I'm calling the method JavaStreamingContext.close(). I also have a monitor that checks if the process has not received any data in the past few minutes in which case it will also close the stream.
In order to accumulate data you have to use the method updateStateByKey (on a PairDStream). This method requires checkpointing to be enabled.
I have checked the Spark code and found that saveAsTextFiles uses foreachRDD, so at the end it will save each RDD separately, so previous RDDs will not be taken into account. Using updateStateByKey it will still save multiple files, but each file will consider all RDDs that were processed before.

Related

How to check if the batches are empty in Spark streaming (wordcount with socketTextStream)

I working on simple SparkStreaming wordcount example to to count the number of words in text data received from a data server listening on a TCP socket.
I would like to check if the batch from streaming source is empty or not before I save the content of every transformation to a text files. Currently, I am using Spark Shell. This is my code
I have tried this code, and it works fine without checking if the batch is empty or not:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
ssc.start()
I have tried to use foreachRDD but it gives me an error error: value saveAsTextFiles is not a member of org.apache.spark.rdd.RDD[String]
This is my code
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(rdd => {
if(!rdd.partitions.isEmpty)
{
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
}
})
ssc.start()
I need to to check if the batch from streaming source is empty or not before I save the content text files. I appreciate your help
I used to do it using following code. I will loop each rdd in stream and then use rdd.count() to judge if a rdd is empty. if all rdds is empty, nothing happened, hope it can help you.
kafkaStream.foreachRDD(rdd -> {
if(rdd.count() > 0) {
// do something
}
})
You can try the below code snippet to check your streaming batches are empty or not:
if(!rdd.partitions.isEmpty)
rdd.saveAsTextFile(outputDir)

How to it make pure?

I have following scala code:
import akka.Done
import akka.actor.ActorSystem
import akka.kafka.ConsumerMessage.CommittableOffsetBatch
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import scala.concurrent.Future
object TestConsumer {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("KafkaConsumer")
implicit val materializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(system, new StringDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val result = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(Sink.foreach(ele => {
print(ele)
system.terminate()
}))
}
}
As you can recognize, the application consumes message from kafka printed out on the shell.
runWith is not pure, it generates some side effect, print out the received message and shutdown the actor.
The question is, how to make it pure with cats IO effects? It is possible?
You don't need cats IO to make it pure. Note that your sink is already pure, because it's just the value that describes what will happen when it's used (in this case using means "connecting to the Source and running the stream").
val sink: Sink[String, Future[Done]] = Sink.foreach(ele => {
print(ele)
// system.terminate() // PROBLEM: terminating the system before stream completes!
})
The problem you described has nothing to do with purity. The problem is that the sink above closes over the value of system, and then tries to terminate it when processing each element of the source.
Terminating the system means that you are destroying the whole runtime environment (used by ActorMaterializer) that is used to run the stream. This should only be done when your stream completes.
val result: Future[Done] = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(sink)
result.onComplete(_ => system.terminate())

Count calls of UDF in Spark

Using Spark 1.6.1 I want to call the number of times a UDF is called. I want to do this because I have a very expensive UDF (~1sec per call) and I suspect the UDF being called more often than the number of records in my dataframe, making my spark job slower than necessary.
Although I could not reproduce this situation, I came up with a simple example showing that the number of calls to the UDF seems to be different (here: less) than the number of rows, how can that be?
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("Demo")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val callCounter = sc.accumulator(0)
val df= sc.parallelize(1 to 10000,numSlices = 100).toDF("value")
println(df.count) // gives 10000
val myudf = udf((d:Int) => {callCounter.add(1);d})
val res = df.withColumn("result",myudf($"value")).cache
println(res.select($"result").collect().size) // gives 10000
println(callCounter.value) // gives 9941
}
If using an accumulator is not the right way to call the counts of the UDF, how else could I do it?
Note: In my actual Spark-Job, get a call-count which is about 1.7 times higher than the actual number of records.
Spark applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo extends App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
// [...]
}
}
This should solve your problem.

How to write Iterator[String] result from mapPartitions into one file?

I am new to Spark and Scala that is why I am having quite a hard time to get through this.
What I intend to do is to pre-process my data with Stanford CoreNLP using Spark. I understand that I have to use mapPartitions in order to have one StanfordCoreNLP instance per partition as suggested in this thread. However, I lack of knowledge/understanding how to proceed from here.
In the end I want to train word vectors on this data but for now I would be happy to find out how I can get my processed data from here and write it into another file.
This is what I got so far:
import java.util.Properties
import com.google.gson.Gson
import edu.stanford.nlp.ling.CoreAnnotations.{LemmaAnnotation, SentencesAnnotation, TokensAnnotation}
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.util.CoreMap
import masterthesis.code.wordvectors.Review
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
object ReviewPreprocessing {
def main(args: Array[String]) {
val resourceUrl = getClass.getResource("amazon-reviews/reviews_Electronics.json")
val file = sc.textFile(resourceUrl.getPath)
val linesPerPartition = file.mapPartitions( lineIterator => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val sentencesAsTextList : List[String] = List()
val pipeline = new StanfordCoreNLP(props)
val gson = new Gson()
while(lineIterator.hasNext) {
val line = lineIterator.next
val review = gson.fromJson(line, classOf[Review])
val doc = new Annotation(review.getReviewText)
pipeline.annotate(doc)
val sentences : java.util.List[CoreMap] = doc.get(classOf[SentencesAnnotation])
val sb = new StringBuilder();
sentences.foreach( sentence => {
val tokens = sentence.get(classOf[TokensAnnotation])
tokens.foreach( token => {
sb.append(token.get(classOf[LemmaAnnotation]))
sb.append(" ")
})
})
sb.setLength(sb.length - 1)
sentencesAsTextList.add(sb.toString)
}
sentencesAsTextList.iterator
})
System.exit(0)
}
}
How would I e.g. write this result into one single file? The ordering does not matter here - I guess the ordering is lost at this point anyway.
In case you'd use saveAsTextFile right on your RDD, you'd end up having as many output files as many partitions you have. In order to have just one you can either coalesce everything into one partition like
sc.textFile("/path/to/file")
.mapPartitions(someFunc())
.coalesce(1)
.saveAsTextFile("/path/to/another/file")
Or (just for fun) you could get all partitions to driver one by one and save all data yourself.
val it = sc.textFile("/path/to/file")
.mapPartitions(someFunc())
.toLocalIterator
while(it.hasNext) {
writeToFile(it.next())
}

Is my code implicitly concurrent?

I have an implementation of WordCount that I submit on an apache-spark cluster.
I was wondering, if tasks are launched on executors that have two cores, will they run concurrently on those two cores?
I've seen this question, but I'm not sure whether or not I can apply the answer to my case.
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val spark = new SparkContext(conf)
val filename = if (args(0).length > 0) args(0) else "hdfs://x.x.x.x:60070/tortue/wordcount"
val textFile = spark.textFile(filename)
val counts = textFile.flatMap(line => line.split(" "))
.map (word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://x.x.x.x:60070/tortue/wcresults")
spark.stop()
}
}
It depends on how many cores Spark is configured to use on the executors, spark.executor.cores is the parameter and its documented in http://spark.apache.org/docs/latest/configuration.html .