Running spark streaming program on cluster

Running spark streaming program on cluster - scala

I have implemented a scala program to find the most popular hashtags on twitter using spark streaming. I have implemented it on eclipse scala IDE. I have access to a cluster called comet, operated by SDSC. I want to run my scala program on this cluster.
Please guide me through the steps to do the above as I have very limited idea about linux.
Below is the code
object PopularHashtags {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("../twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
def main(args: Array[String]) {
setupTwitter()
val ssc = new StreamingContext("local[*]", "PopularHashtags", Seconds(1))
setupLogging()
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(status => status.getText())
val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
val hashtags = tweetwords.filter(word => word.startsWith("#"))
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))
val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))
val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print
ssc.checkpoint("C:/checkpoint/")
ssc.start()
ssc.awaitTermination()
}
}
P.S.: The twitter API keys are stored in a text file in my eclipse workspace.

Related

Spark Application. Why the "executor computing time" is too long?

Spark Application
deploy mode:standalone
I want to know why the input data is same but the computing time is so different for a task between two different "WordCount" code.
For example:
1.The original "WordCount" code
object ScalaWordCount{
def main(args: Array[String]){
if (args.length < 2){
System.err.println(
s"Usage: $ScalaWordCount <INPUT_HDFS> <OUTPUT_HDFS>"
)
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("ScalaWordCount")
val sc = new SparkContext(sparkConf)
val io = new IOCommon(sc)
val data = io.load[String](args(0))
val counts = data.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
io.save(args(1), counts)
sc.stop()
}
}
the task result:
task duration
2.the other "WordCount" Code
object ScalaWordCount{
def main(args: Array[String]){
if (args.length < 2){
System.err.println(
s"Usage: $ScalaWordCount <INPUT_HDFS> <OUTPUT_HDFS>"
)
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("ScalaWordCount")
val sc = new SparkContext(sparkConf)
val io = new IOCommon(sc)
val data = io.load[String](args(0))
val flatRdd = data.flatMap(line => line.split(" "))
val mapRdd = flatRdd.map(word => {
val pre = scala.util.Random.nextInt(10).toString
val key = pre + "_" + word
(key, 1)
})
val shuffleRdd = mapRdd.reduceByKey(_ + _)
val shuffleMapRdd = shuffleRdd.map{ case (k, v) => (k.split("_")(1), v) }
val counts = shuffleMapRdd.reduceByKey(_ + _)
io.save(args(1), counts)
sc.stop()
}
}
the task result:
task duration
So I want to know what will cause this.
Thanks a lot.

spark unit test with mock spark session

I have a method which transforms dataset from dataframe. Method looks like:
def dataFrameToDataSet[T](sourceName: String, df: DataFrame)
(implicit spark: SparkSession): Dataset[T] = {
import spark.implicits._
sourceName match {
case "oracle_grc_asset" =>
val ds = df.map(row => grc.Asset(row)).as[grc.Asset]
ds.asInstanceOf[Dataset[T]]
case "oracle_grc_asset_host" =>
val ds = df.map(row => grc.AssetHost(row)).as[grc.AssetHost]
ds.asInstanceOf[Dataset[T]]
case "oracle_grc_asset_tag" =>
val ds = df.map(row => grc.AssetTag(row)).as[grc.AssetTag]
ds.asInstanceOf[Dataset[T]]
case "oracle_grc_asset_tag_asset" =>
val ds = df.map(row => grc.AssetTagAsset(row)).as[grc.AssetTagAsset]
ds.asInstanceOf[Dataset[T]]
case "oracle_grc_qg_subscription" =>
val ds = df.map(row => grc.QgSubscription(row)).as[grc.QgSubscription]
ds.asInstanceOf[Dataset[T]]
case "oracle_grc_host_instance_vuln" =>
val ds = df.map(row => grc.HostInstanceVuln(row)).as[grc.HostInstanceVuln]
ds.asInstanceOf[Dataset[T]]
case _ => throw new RuntimeException("Function dataFrameToDataSet doesn't support provided case class type!")
}
}
Now i want to test this method. For this, i have created one test class which looks like:
"A dataFrameToDataSet function" should "return DataSet from dataframe" in {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
implicit val ss :SparkSession= SparkSession.builder().config(conf).getOrCreate()
import ss.implicits._
//val sourceName = List("oracle_grc_asset", "oracle_grc_asset_host", "oracle_grc_asset_tag", "oracle_grc_asset_tag_asset", "oracle_grc_qg_subscription", "oracle_grc_host_instance_vuln")
val sourceName1 = "oracle_grc_asset"
val df = Seq(grc.Asset(123,"bat", Some("abc"), "cat", Some("abc"), Some(1), java.math.BigDecimal.valueOf(3.4) , Some(2), Some(2),Some("abc"), Some(2), Some("abc"), Some(java.sql.Timestamp.valueOf("2011-10-02 18:48:05.123456")), Some(6), Some(4), java.sql.Timestamp.valueOf("2011-10-02 18:48:05.123456"), java.sql.Timestamp.valueOf("2011-10-02 18:48:05.123456"), "India", "Test","Pod01")).toDF()
val ds = Seq(grc.Asset(123,"bat", Some("abc"), "cat", Some("abc"), Some(1), java.math.BigDecimal.valueOf(3.4) , Some(2), Some(2),Some("abc"), Some(2), Some("abc"), Some(java.sql.Timestamp.valueOf("2011-10-02 18:48:05.123456")), Some(6), Some(4), java.sql.Timestamp.valueOf("2011-10-02 18:48:05.123456"), java.sql.Timestamp.valueOf("2011-10-02 18:48:05.123456"), "India", "Test","Pod01")).toDS()
assert(dataFrameToDataSet(sourceName1, df) == ds)
}
}
this test case fails and i am getting a FileNotFound exception:HADOOP_HOME not found. Though i have created HADOOP_HOME with winutils.exe in my system variable
please give me some suitable solution for this.

Spark Streaming Twitter createStream Issue

I was trying to Stream data from Twitter Using Spark Streaming . But the
below issue.
import org.apache.spark.streaming.twitter._
import twitter4j.auth._
import twitter4j.conf._
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(10))
val cb = new ConfigurationBuildercb.setDebugEnabled(true).setOAuthConsumerKey("").setOAuthConsumerSecret("").setOAuthAccessToken ("").setOAuthAccessTokenSecret("")
val auth = new OAuthAuthorization(cb.build)
val tweets = TwitterUtils.createStream(ssc,auth)
ERROR SCREEN:
val tweets = TwitterUtils.createStream(ssc,auth)
<console>:49: error: overloaded method value createStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,twitterAuth: twitter4j.auth.Authorization)org.apache.spark.streaming.api.java.JavaReceiverInputDStream[twitter4j.Status] <and>
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,filters: Array[String])org.apache.spark.streaming.api.java.JavaReceiverInputDStream[twitter4j.Status] <and>
(ssc: org.apache.spark.streaming.StreamingContext,twitterAuth: Option[twitter4j.auth.Authorization],filters: Seq[String],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.dstream.ReceiverInputDStream[twitter4j.Status]
cannot be applied to (org.apache.spark.streaming.StreamingContext, twitter4j.auth.OAuthAuthorization)
val tweets = TwitterUtils.createStream(ssc,auth)

The method in the question has this signature:
def createStream(
ssc: StreamingContext,
twitterAuth: Option[Authorization],
filters: Seq[String] = Nil,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
)
We can see that ssc: StreamingContext and twitterAuth: Option[Authorization] are mandatory. The two other are optional.
In your case, the twitterAuth type is incorrect. It's an Option[Authorization]. The call, in this case, should look like this:
val tweets = TwitterUtils.createStream(ssc, Some(auth))

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
object TwitterStream {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
/** Configures Twitter service credentials using twiter.txt in the main
workspace directory */
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("/Users/sampy/twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
/** Our main function where the action happens */
def main(args: Array[String]) {
setupTwitter()
val ssc = new StreamingContext("local[*]",
"PopularHashtags",Seconds(5))
setupLogging()
val tweets = TwitterUtils.createStream(ssc, None)
val engTweets = tweets.filter(x => x.getLang() == "en")
val statuses = engTweets.map(status => status.getText)
val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
val hashtags = tweetwords.filter(word => word.startsWith("#"))
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1)) //
val hashtagCounts =
hashtagKeyValues.reduceByKeyAndWindow((x:Int,y:Int)=>x+y, Seconds(5),
Seconds(20))
val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x =>
x._2, false))
sortedResults.saveAsTextFiles("/Users/sampy/tweetsTwitter","txt")
sortedResults.print
ssc.checkpoint("/Users/sampy/checkpointTwitter")
ssc.start()
ssc.awaitTermination()
}
}

Transform data using scala in spark

I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}

I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}

Split key value in map scala

I don't know if it is possible, but I'd like in my mapPartitions to split in two lists the variable "a". Like here to have a list l that stores all numbers and an other list let's say b that stores all words. with something like a.mapPartitions((p,v) =>{ val l = p.toList; val b = v.toList; ....}
With for example in my for loop l(i)=1 and b(i) ="score"
import scala.io.Source
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
val a = sc.parallelize(List(("score",1),("chicken",2),("magnacarta",2)) )
a.mapPartitions(p =>{val l = p.toList;
val ret = new ListBuffer[Int]
val words = new ListBuffer[String]
for(i<-0 to l.length-1){
words+= b(i)
ret += l(i)
}
ret.toList.iterator
}
)

Spark is a distributed computing engine. you can perform operation on partitioned data across nodes of the cluster. Then you need a Reduce() method that performs a summary operation.
Please see this code that should do what you want:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleApp {
class MyResponseObj(var numbers: List[Int] = List[Int](), var words: List[String] = List[String]()) extends java.io.Serializable{
def +=(str: String, int: Int) = {
numbers = numbers :+ int
words = words :+ str
this
}
def +=(other: MyResponseObj) = {
numbers = numbers ++ other.numbers
words = words ++ other.words
this
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val a = sc.parallelize(List(("score", 1), ("chicken", 2), ("magnacarta", 2)))
val myResponseObj = a.mapPartitions[MyResponseObj](it => {
var myResponseObj = new MyResponseObj()
it.foreach {
case (str :String, int :Int) => myResponseObj += (str, int)
case _ => println("unexpected data")
}
Iterator(myResponseObj)
}).reduce( (myResponseObj1, myResponseObj2) => myResponseObj1 += myResponseObj2 )
println(myResponseObj.words)
println(myResponseObj.numbers)
}
}