Spark Application
deploy mode:standalone
I want to know why the input data is same but the computing time is so different for a task between two different "WordCount" code.
For example:
1.The original "WordCount" code
object ScalaWordCount{
def main(args: Array[String]){
if (args.length < 2){
System.err.println(
s"Usage: $ScalaWordCount <INPUT_HDFS> <OUTPUT_HDFS>"
)
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("ScalaWordCount")
val sc = new SparkContext(sparkConf)
val io = new IOCommon(sc)
val data = io.load[String](args(0))
val counts = data.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
io.save(args(1), counts)
sc.stop()
}
}
the task result:
task duration
2.the other "WordCount" Code
object ScalaWordCount{
def main(args: Array[String]){
if (args.length < 2){
System.err.println(
s"Usage: $ScalaWordCount <INPUT_HDFS> <OUTPUT_HDFS>"
)
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("ScalaWordCount")
val sc = new SparkContext(sparkConf)
val io = new IOCommon(sc)
val data = io.load[String](args(0))
val flatRdd = data.flatMap(line => line.split(" "))
val mapRdd = flatRdd.map(word => {
val pre = scala.util.Random.nextInt(10).toString
val key = pre + "_" + word
(key, 1)
})
val shuffleRdd = mapRdd.reduceByKey(_ + _)
val shuffleMapRdd = shuffleRdd.map{ case (k, v) => (k.split("_")(1), v) }
val counts = shuffleMapRdd.reduceByKey(_ + _)
io.save(args(1), counts)
sc.stop()
}
}
the task result:
task duration
So I want to know what will cause this.
Thanks a lot.
Related
I am trying the combineByKey from Spark to count words. I am not sure, but I guess the functions of merge and combiner could be the same because the count operation can be the same on the combiner and on the reducer. This would not happen if I was taking the average.
How can I implement this word count using the same function for the merge and the combine?
Other thing is why my result is showing two times the value that I am counting? How can I implement the combineByKey to show only the key and the sum of the values only once?
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf}
import scala.collection.mutable.Queue
object TestStreamCombineByKey {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("QueueStreamWordCount")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create a DStream that will connect to hostname:port, like localhost:9999
val rddQueue = new Queue[RDD[String]]()
val lines = ssc.queueStream(rddQueue)
val wordCounts = lines
.flatMap(_.split(" "))
.map(word => (word, 1))
.combineByKey(
(v) => (v, 1), //createCombiner
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), //mergeValue
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2), // mergeCombiners
new HashPartitioner(3)
)
wordCounts.print()
ssc.start() // Start the computation
// Create and push some RDDs into the queue
val thread = new Thread("pool data source") {
override def run() {
while (true) {
rddQueue.synchronized {
rddQueue += ssc.sparkContext.makeRDD(List("to be or not to be , that is the question , or what would be the question ?"))
}
Thread.sleep(100)
}
}
}
thread.start()
ssc.awaitTermination() // Wait for the computation to terminate
}
}
my current output. Why is it showing two times the sum of the values?
(or,(2,2))
(would,(1,1))
(?,(1,1))
(the,(2,2))
(not,(1,1))
(is,(1,1))
(that,(1,1))
(be,(3,3))
(what,(1,1))
(question,(2,2))
That is what you are doing in your combineByKey block, instead of initializing into tuple (v,1) just keep it as is, i.e v => v,
You should change your code like this to get value only once,
val wordCounts = lines
.flatMap(_.split(" "))
.map(word => (word, 1))
.combineByKey(
(v:Int) => v, //createCombiner
(acc: Int, v:Int) => acc + v, //mergeValue
(acc1: Int, acc2: Int) => acc1 + acc2, // mergeCombiners
new HashPartitioner(3)
)
I have implemented a scala program to find the most popular hashtags on twitter using spark streaming. I have implemented it on eclipse scala IDE. I have access to a cluster called comet, operated by SDSC. I want to run my scala program on this cluster.
Please guide me through the steps to do the above as I have very limited idea about linux.
Below is the code
object PopularHashtags {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("../twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
def main(args: Array[String]) {
setupTwitter()
val ssc = new StreamingContext("local[*]", "PopularHashtags", Seconds(1))
setupLogging()
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(status => status.getText())
val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
val hashtags = tweetwords.filter(word => word.startsWith("#"))
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))
val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))
val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print
ssc.checkpoint("C:/checkpoint/")
ssc.start()
ssc.awaitTermination()
}
}
P.S.: The twitter API keys are stored in a text file in my eclipse workspace.
I have the following situation, I have bunch of directories which have bunch of files. I'm processing them using AKKA but for some reasons only the last sequence is processed, Here is the code of the method I have, let me know if you see something wrong
def read(): Unit = {
implicit val system = ActorSystem("LiveS3Parser")
implicit val materializer = ActorMaterializer()
val reader = new LiveSequenceFileReader(conf.getString("s3url"))
val dateList = generateDates(conf.getString("startDate"), conf.getString("endDate"))
reader.readAllFilesFromPath(conf.getString("s3url"))
val seqElements = generateURLS(dateList, conf).via(readDataFromS3(reader)).via(parseJsonSeq())
val sinkseq = Sink.fold(0)(persistDataSeq)
val dataCounter = seqElements.toMat(sinkseq)(Keep.right)
val sum: Future[Int] = dataCounter.run()
sum.andThen({
case _ =>
sum.foreach(c => println(s"Total records Loaded: $c"))
})
Await.result(sum,Duration.Inf)
}
def generateURLS(data: Seq[Long], conf: Config): Source[String, NotUsed] = {
val s3URL = conf.getString("s3url")
val dataWithURLs = data.map(x => s3URL.concat("dt=").concat(DateUtils.formatDate(new Date(x), "yyyy-MM-dd")))
Source(dataWithURLs.to[scala.collection.immutable.Seq])
}
def readDataFromS3(lv: LiveSequenceFileReader)(implicit ec: ExecutionContext): Flow[String, Seq[KeyValue], NotUsed] = {
Flow[String].mapAsyncUnordered(Runtime.getRuntime().availableProcessors())(url => Future(readFiles(url, lv)))
}
def parseJsonSeq()(implicit ec: ExecutionContext): Flow[Seq[KeyValue], Seq[Try[OptimizedSearchQueryEventMessage]], NotUsed] = {
Flow[Seq[KeyValue]].mapAsyncUnordered(Runtime.getRuntime().availableProcessors())(line => Future(parseAllItems(line)))
}
def readFiles(url: String, lv: LiveSequenceFileReader): Seq[KeyValue] = {
println("Reading Files from " + url)
val files = lv.readAllFilesFromPath(url)
println("Records to process" + files.size())
files
}
def parseAllItems(seq: Seq[KeyValue]) = {
seq.map(kv => parseItem(kv.getValue))
}
def parseItem(data: String): Try[OptimizedSearchQueryEventMessage] = {
val retVal = Try(mapper.readValue(data, classOf[OptimizedSearchQueryEventMessage]))
retVal
}
def generateDates(startingDate: String, endDate: String): Seq[Long] = {
val fmt = new SimpleDateFormat("yyyy-MM-dd")
val startDate = fmt.parse(startingDate).getTime
val endingDate = fmt.parse(endDate).getTime
val list = for (currentDate <- startDate to endingDate by TimeUnit.DAYS.toMillis(1)) yield currentDate
list
}
I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}
I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}
I don't know if it is possible, but I'd like in my mapPartitions to split in two lists the variable "a". Like here to have a list l that stores all numbers and an other list let's say b that stores all words. with something like a.mapPartitions((p,v) =>{ val l = p.toList; val b = v.toList; ....}
With for example in my for loop l(i)=1 and b(i) ="score"
import scala.io.Source
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
val a = sc.parallelize(List(("score",1),("chicken",2),("magnacarta",2)) )
a.mapPartitions(p =>{val l = p.toList;
val ret = new ListBuffer[Int]
val words = new ListBuffer[String]
for(i<-0 to l.length-1){
words+= b(i)
ret += l(i)
}
ret.toList.iterator
}
)
Spark is a distributed computing engine. you can perform operation on partitioned data across nodes of the cluster. Then you need a Reduce() method that performs a summary operation.
Please see this code that should do what you want:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleApp {
class MyResponseObj(var numbers: List[Int] = List[Int](), var words: List[String] = List[String]()) extends java.io.Serializable{
def +=(str: String, int: Int) = {
numbers = numbers :+ int
words = words :+ str
this
}
def +=(other: MyResponseObj) = {
numbers = numbers ++ other.numbers
words = words ++ other.words
this
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val a = sc.parallelize(List(("score", 1), ("chicken", 2), ("magnacarta", 2)))
val myResponseObj = a.mapPartitions[MyResponseObj](it => {
var myResponseObj = new MyResponseObj()
it.foreach {
case (str :String, int :Int) => myResponseObj += (str, int)
case _ => println("unexpected data")
}
Iterator(myResponseObj)
}).reduce( (myResponseObj1, myResponseObj2) => myResponseObj1 += myResponseObj2 )
println(myResponseObj.words)
println(myResponseObj.numbers)
}
}