Is my code implicitly concurrent? - scala

I have an implementation of WordCount that I submit on an apache-spark cluster.
I was wondering, if tasks are launched on executors that have two cores, will they run concurrently on those two cores?
I've seen this question, but I'm not sure whether or not I can apply the answer to my case.
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val spark = new SparkContext(conf)
val filename = if (args(0).length > 0) args(0) else "hdfs://x.x.x.x:60070/tortue/wordcount"
val textFile = spark.textFile(filename)
val counts = textFile.flatMap(line => line.split(" "))
.map (word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://x.x.x.x:60070/tortue/wcresults")
spark.stop()
}
}

It depends on how many cores Spark is configured to use on the executors, spark.executor.cores is the parameter and its documented in http://spark.apache.org/docs/latest/configuration.html .

Related

Streaming from HDFS folder

I am trying to implement a scala + spark solution to streaming a word count information from new values from a HDFS folder, like this:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import StreamingContext._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
object HdfsWordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("HdfsWordCount")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
I tried with spark-shell running HdfsWordCount.main(Array('hdfs:///user/cloudera/sparkStreaming/'))
and it just give a | and leaving me to type. Am I doing something wrong?

How to check if the batches are empty in Spark streaming (wordcount with socketTextStream)

I working on simple SparkStreaming wordcount example to to count the number of words in text data received from a data server listening on a TCP socket.
I would like to check if the batch from streaming source is empty or not before I save the content of every transformation to a text files. Currently, I am using Spark Shell. This is my code
I have tried this code, and it works fine without checking if the batch is empty or not:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
ssc.start()
I have tried to use foreachRDD but it gives me an error error: value saveAsTextFiles is not a member of org.apache.spark.rdd.RDD[String]
This is my code
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(rdd => {
if(!rdd.partitions.isEmpty)
{
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
}
})
ssc.start()
I need to to check if the batch from streaming source is empty or not before I save the content text files. I appreciate your help
I used to do it using following code. I will loop each rdd in stream and then use rdd.count() to judge if a rdd is empty. if all rdds is empty, nothing happened, hope it can help you.
kafkaStream.foreachRDD(rdd -> {
if(rdd.count() > 0) {
// do something
}
})
You can try the below code snippet to check your streaming batches are empty or not:
if(!rdd.partitions.isEmpty)
rdd.saveAsTextFile(outputDir)

Spark streaming: How to write cumulative output?

I have to write a single output file for my streaming job.
Question : when will my job actually stop? I killed the server but did not work.
I want to stop my job from commandline(If it is possible)
Code:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
object MAYUR_BELDAR_PROGRAM5_V1 {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.socketTextStream("localhost", args(0).toInt)
val words = lines.flatMap(_.split(" "))
val class1 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==0)
val class2 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==1)
val class3 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==0)
val class4 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==1)
class1.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class1","txt")
class2.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class2", "txt")
class3.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class3","txt")
class4.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class4","txt")
ssc.start() // Start the computation
ssc.awaitTermination()
ssc.stop()
}
}
A stream by definition does not have an end so it will not stop unless you call the method to stop it. In my case I have a business condition that tell when the process is finished, so when I reach this point I'm calling the method JavaStreamingContext.close(). I also have a monitor that checks if the process has not received any data in the past few minutes in which case it will also close the stream.
In order to accumulate data you have to use the method updateStateByKey (on a PairDStream). This method requires checkpointing to be enabled.
I have checked the Spark code and found that saveAsTextFiles uses foreachRDD, so at the end it will save each RDD separately, so previous RDDs will not be taken into account. Using updateStateByKey it will still save multiple files, but each file will consider all RDDs that were processed before.

Spark scala input/output directories

I am new to Spark/Scala Programming.I am able to do the set up using the maven and able to run the sample word count program.
I am having 2 questions over here for both running in spark environment/ in Windows local:
1.How the scala program is identifying the input.
2.How to write the output into text file.
Here is my code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object WordCount {
def main(args: Array[String]) = {
//Start the Spark context
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local")
val sc = new SparkContext(conf)
//Read some example file to a test RDD
val textFile = sc.textFile("file:/home/root1/Avinash/data.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.foreach(println)
counts.collect()
counts.saveAsTextFile("file:/home/root1/Avinash/output")
}
}
When I place the file in file:/home/root1/Avinash/data.txt and try to run it didnt work.Only when i place the data.txt in /home/root1/softs/spark-1.6.1/bin or inside the project folder in workspace it is trying to take the input.
Similarly, when I am trying to write into output using counts.saveAsTextFile("file:/home/root1/Avinash/output"), it is not writing and instead it is throwing the error as
Exception in thread "main" java.io.IOException: No FileSystem for scheme: D.
Please help me in resolving this!!.
you suppose to use /// on file. this is an example
val textFile = sc.textFile("file:///home/root1/Avinash/data.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _).cache()
counts.foreach(println)
//counts.collect()
counts.saveAsTextFile("file:///home/root1/Avinash/output")
use cache to avoid compute every time you are doing action on RDD if the file is big

display the content of clusters after clustering in streaming-k-means.scala code source in spark

i want to run the streaming k-means-example.scala code source (mllib) on spark , someone tell me how i can how I can display the content of clusters after clustering (for example i want to clustering data into 3 clusters , how i can display the cntent of the 3 clusters in 3 files and the content of centers in file.txt)
package org.apache.spark.examples.mllib
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKMeansExample {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println( "Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
val conf = new SparkConf().setMaster("localhost").setAppName
("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val model = new StreamingKMeans().setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
You would have to use the predict method on your RDD( look here for reference)
Then you could zip your Rdd containing values and your RDD of predicted clusters they fall in.