spark streaming wordcount is not printing the results - scala

I am trying to run this app in spark streaming the code is from a book I am reading but unfortunately I am not getting the expected results. There is a java class in which I open a socket and wait for an input. I run the socket code and connect it properly with the spark job. Then I submit the following job and I get a message that I connected successfully. When I type something in the socket I want to get a wordcount result printed in the terminal instead I am getting this message:
INFO BlockManagerInfo: Added input-0-1480077969600 in memory on 192.168.1.4:38818 (size: 7.0 B, free: 265.1 MB)
where is the problem? See the code bellow, thanks in advance
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.dstream.ForEachDStream
object ScalaFirstStreamingExample {
def main(args: Array[String]){
println("Creating Spark Configuration") //Create an Object of Spark Configuration
val conf = new SparkConf() //Set the logical and user defined Name of this Application
conf.setAppName("My First Spark Streaming Application")
println("Retreiving Streaming Context from Spark Conf") //Retrieving Streaming Context from SparkConf Object.
//Second parameter is the time interval at which
//streaming data will be divided into batches
val streamCtx = new StreamingContext(conf, Seconds(2)) //Define the type of Stream. Here we are using TCP
//Socket as textstream,
//It will keep watching for the incoming data from a
//specific machine (localhost) and port (9087)
//Once the data is retrieved it will be saved in the
//memory and in case memory
//is not sufficient, then it will store it on the Disk
//It will further read the Data and convert it into DStream
val lines = streamCtx.socketTextStream("localhost", 9087, MEMORY_AND_DISK_SER_2) //Apply the Split() function to all elements of DStream
//which will further generate multiple new records from
//each record in Source Stream
//And then use flatmap to consolidate all records and
//create a new DStream.
val words = lines.flatMap(x => x.split(" ")) //Now, we will count these words by applying a using map()
//map() helps in applying a given function to each
//element in an RDD.
val pairs = words.map(word => (word, 1)) //Further we will aggregate the value of each key by
//using/applying the given function.
val wordCounts = pairs.reduceByKey(_ + _) //Lastly we will print all Values
//wordCounts.print(20)
myPrint(wordCounts,streamCtx)
//Most important statement which will initiate the
//Streaming Context
streamCtx.start();
//Wait till the execution is completed.
streamCtx.awaitTermination();
}
def myPrint(stream:DStream[(String,Int)],streamCtx: StreamingContext){
stream.foreachRDD(foreachFunc)
def foreachFunc = (rdd: RDD[(String,Int)]) => {
val array = rdd.collect()
println("---------Start Printing Results----------")
for(res<-array){
println(res)
}
println("---------Finished Printing Results----------")
}
}
}

Related

Why does foreachRDD not populate DataFrame with new content using StreamingContext.textFileStream?

My problem is that, as I change my code into streaming mode and put my data frame into the foreach loop, the data frame shows empty table! I does't fill! I also can not put it into assembler.transform(). The error is:
Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U].
Unspecified value parameter mapFunc.
val dataFrame = Train_DStream.map()
My train.csv file is like below:
Please help me.
Here is my code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressionWithSGD}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Try
/**
* Created by saeedtkh on 5/22/17.
*/
object ML_Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
// Create the context
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
//val Test_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv").map(LabeledPoint.parse)
val Train_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val DStream =Train_DStream.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
DStream.foreachRDD { Test_DStream =>
val dataFrame = sqlContext.createDataFrame(Test_DStream, customSchema)
dataFrame.groupBy("column1", "column2").count().show()
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
val featureCol = Array("column1", "column2")
val assembler=new VectorAssembler().setInputCols(featureCol).setOutputCol("features")
dataFrame.show()
val df_new=assembler.transform(dataFrame)
}
ssc.start()
ssc.awaitTermination()
}
}
My guess is that all the files under /Users/saeedtkh/Desktop/sharedsaeed/train.csv directory have already been processed and so there are no files left and hence the DataFrame is empty.
Please note that the sole input parameter for StreamingContext.textFileStream is a directory not a file.
textFileStream(directory: String): DStream[String] Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
Please also note that once a file has ever been processed in a Spark Streaming application, this file should not be changed (or appended to) since the file has already been marked as processed and Spark Streaming will ignore any modifications.
Quoting the official documentation of Spark Streaming in Basic Sources:
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
Note that
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
Please also replace setMaster("local") with setMaster("local[*]") to make sure your Spark Streaming application will have enough threads to process incoming data (you have to have at least 2 threads).

Spark streaming: How to write cumulative output?

I have to write a single output file for my streaming job.
Question : when will my job actually stop? I killed the server but did not work.
I want to stop my job from commandline(If it is possible)
Code:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
object MAYUR_BELDAR_PROGRAM5_V1 {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.socketTextStream("localhost", args(0).toInt)
val words = lines.flatMap(_.split(" "))
val class1 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==0)
val class2 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==1)
val class3 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==0)
val class4 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==1)
class1.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class1","txt")
class2.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class2", "txt")
class3.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class3","txt")
class4.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class4","txt")
ssc.start() // Start the computation
ssc.awaitTermination()
ssc.stop()
}
}
A stream by definition does not have an end so it will not stop unless you call the method to stop it. In my case I have a business condition that tell when the process is finished, so when I reach this point I'm calling the method JavaStreamingContext.close(). I also have a monitor that checks if the process has not received any data in the past few minutes in which case it will also close the stream.
In order to accumulate data you have to use the method updateStateByKey (on a PairDStream). This method requires checkpointing to be enabled.
I have checked the Spark code and found that saveAsTextFiles uses foreachRDD, so at the end it will save each RDD separately, so previous RDDs will not be taken into account. Using updateStateByKey it will still save multiple files, but each file will consider all RDDs that were processed before.

How do i pass Spark context to a function from foreach

I need to pass SparkContext to my function and please suggest me how to do that for below scenario.
I have a Sequence, each element refers to specific data source from which we gets RDD and process them. I have defined a function which takes spark context and the data source and does the necessary things. I am curretly using while loop. But, i would like to do it with foreach or map, so that i can imply parallel processing. I need to spark context for the function, but how can i pass it from the foreach.?
Just a SAMPLE code, as i cannot present the actual code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object RoughWork {
def main(args: Array[String]) {
val str = "Hello,hw:How,sr:are,ws:You,re";
val conf = new SparkConf
conf.setMaster("local");
conf.setAppName("app1");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val rdd = sc.parallelize(str.split(":"))
rdd.map(x => {println("==>"+x);passTest(sc, x)}).collect();
}
def passTest(context: SparkContext, input: String) {
val rdd1 = context.parallelize(input.split(","));
rdd1.foreach(println)
}
}
You cannot pass the SparkContext around like that. passTest will be run on an/the executor(s), while the SparkContext runs on the driver.
If I would have to do a double split like that, one approach would be to use flatMap:
rdd
.zipWithIndex
.flatMap(l => {
val parts = l._1.split(",");
List.fill(parts.length)(l._2) zip parts})
.countByKey
There may be prettier ways, but basically the idea is that you can use zipWithIndex to keep track which line an item came from and then use key-value pair RDD methods to work on your data.
If you have more than one key, or just more structured data in general, you can look into using Spark SQL with DataFrames (or DataSets in latest version) and explode instead of flatMap.

Spark scala running

Hi I am new to spark and scala. I am running scala code in spark scala prompt. The program is fine, it's showing "defined module MLlib" but its not printing anything on screen. What have I done wrong? Is there any other way to run this program spark in scala shell and get the output?
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
object MLlib {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("/home/training/Spam.txt")
val ham = sc.textFile("/home/training/Ham.txt")
// Create a HashingTF instance to map email text to vectors of 100 features.
val tf = new HashingTF(numFeatures = 100)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${model.predict(negTestExample)}")
sc.stop()
}
}
A couple of things:
You defined your object in the the Spark shell, so the main class won't get called immediately. You'll have to call it explicitly after you define the object:
MLlib.main(Array())
In fact, if you continue to work on the shell/REPL you can do away with the object altogether; you can define the function directly. For example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
def MLlib {
//the rest of your code
}
However, you shouldn't initialize SparkContext it within the shell. From the documentation:
In the Spark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work
So, you have to either remove that bit from your code, or compile it into a jar and run it using spark-submit

Find size of data stored in rdd from a text file in apache spark

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .
Is there a way by which I can get the size of data in rdd .
This is my code :
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row
object RddSize {
def main(args: Array[String]) {
val sc = new SparkContext("local", "data size")
val FILE_LOCATION = "src/main/resources/employees.csv"
val peopleRdd = sc.textFile(FILE_LOCATION)
val newRdd = peopleRdd.filter(str => str.contains(",M,"))
//Here I want to find whats the size remaining data
}
}
I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).
There are multiple way to get the RDD size
1.Add the spark listener in your spark context
SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
val map = stageCompleted.stageInfo.rddInfos
map.foreach(row => {
println("rdd memSize " + row.memSize)
println("rdd diskSize " + row.diskSize)
})
}})
2. Save you rdd as text file.
myRDD.saveAsTextFile("person.txt")
and call Apache Spark REST API.
/applications/[app-id]/stages
3. You can also try SizeEstimater
val rddSize = SizeEstimator.estimate(myRDD)
I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.
def calcRDDSize(rdd: RDD[String]): Long = {
//map to the size of each string, UTF-8 is the default
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
You can then call this function for your two RDDs:
println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")
This solution should work even if the file size is larger than the memory available in the cluster.
The Spark API doc says that:
You can get info about your RDDs from the Spark context: sc.getRDDStorageInfo
The RDD info includes memory and disk size: RDDInfo doc