Spark Streaming Twitter createStream Issue - scala

I was trying to Stream data from Twitter Using Spark Streaming . But the
below issue.
import org.apache.spark.streaming.twitter._
import twitter4j.auth._
import twitter4j.conf._
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(10))
val cb = new ConfigurationBuildercb.setDebugEnabled(true).setOAuthConsumerKey("").setOAuthConsumerSecret("").setOAuthAccessToken ("").setOAuthAccessTokenSecret("")
val auth = new OAuthAuthorization(cb.build)
val tweets = TwitterUtils.createStream(ssc,auth)
ERROR SCREEN:
val tweets = TwitterUtils.createStream(ssc,auth)
<console>:49: error: overloaded method value createStream with alternatives:
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,twitterAuth: twitter4j.auth.Authorization)org.apache.spark.streaming.api.java.JavaReceiverInputDStream[twitter4j.Status] <and>
(jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,filters: Array[String])org.apache.spark.streaming.api.java.JavaReceiverInputDStream[twitter4j.Status] <and>
(ssc: org.apache.spark.streaming.StreamingContext,twitterAuth: Option[twitter4j.auth.Authorization],filters: Seq[String],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.dstream.ReceiverInputDStream[twitter4j.Status]
cannot be applied to (org.apache.spark.streaming.StreamingContext, twitter4j.auth.OAuthAuthorization)
val tweets = TwitterUtils.createStream(ssc,auth)

The method in the question has this signature:
def createStream(
ssc: StreamingContext,
twitterAuth: Option[Authorization],
filters: Seq[String] = Nil,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
)
We can see that ssc: StreamingContext and twitterAuth: Option[Authorization] are mandatory. The two other are optional.
In your case, the twitterAuth type is incorrect. It's an Option[Authorization]. The call, in this case, should look like this:
val tweets = TwitterUtils.createStream(ssc, Some(auth))

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
object TwitterStream {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
/** Configures Twitter service credentials using twiter.txt in the main
workspace directory */
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("/Users/sampy/twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
/** Our main function where the action happens */
def main(args: Array[String]) {
setupTwitter()
val ssc = new StreamingContext("local[*]",
"PopularHashtags",Seconds(5))
setupLogging()
val tweets = TwitterUtils.createStream(ssc, None)
val engTweets = tweets.filter(x => x.getLang() == "en")
val statuses = engTweets.map(status => status.getText)
val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
val hashtags = tweetwords.filter(word => word.startsWith("#"))
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1)) //
val hashtagCounts =
hashtagKeyValues.reduceByKeyAndWindow((x:Int,y:Int)=>x+y, Seconds(5),
Seconds(20))
val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x =>
x._2, false))
sortedResults.saveAsTextFiles("/Users/sampy/tweetsTwitter","txt")
sortedResults.print
ssc.checkpoint("/Users/sampy/checkpointTwitter")
ssc.start()
ssc.awaitTermination()
}
}

Related

Put elements in stream and return an object

In akka, I want to put the elements in stream and return an object. I know the elements could be a source to run a graph. But how can I put the element and return an object on runtime?
import akka.actor.ActorSystem
import akka.stream.QueueOfferResult.{Dropped, Enqueued, Failure, QueueClosed}
import akka.stream.{ActorMaterializer, OverflowStrategy}
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.Array.range
import scala.util.Success
object StreamElement {
implicit val system = ActorSystem("StreamElement")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
def main(args: Array[String]): Unit = {
val (queue, value) = Source
.queue[Int](10, OverflowStrategy.backpressure)
.map(x => {
x * x
})
.toMat(Sink.asPublisher(false))(Keep.both)
.run()
range(0, 10)
.map(x => {
queue.offer(x).onComplete {
case Success(Enqueued) => {
}
case Success(Dropped) => {}
case _ => {
println("others")
}
}
})
}
}
How can I get the value returned?
Actually, you want to return the int value for each element.
So you could create the flow, then connect to source and Sink for each time.
package tech.parasol.scala.akka
import akka.actor.ActorSystem
import akka.stream.QueueOfferResult.{Dropped, Enqueued, Failure, QueueClosed}
import akka.stream.{ActorMaterializer, OverflowStrategy}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import scala.Array.range
import scala.util.Success
object StreamElement {
implicit val system = ActorSystem("StreamElement")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val flow = Flow[Int]
.buffer(16, OverflowStrategy.backpressure)
.map(x => x * x)
def main(args: Array[String]): Unit = {
range(0, 10)
.map(x => {
Source.single(x).via(flow).runWith(Sink.head)
}.map( v => println("v ===> " + v)
))
}
}
It's unclear to me why the Scala collection isn't fed to the Stream as a Source in your sample code. Given that you've already composed a Stream with materialized values to be captured in a Source Queue and a publisher Sink, you could create a subscriber Source using Source.fromPublisher to collect the wanted values, as shown below:
import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.stream._
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer() // Not needed for Akka 2.6+
val (queue, pub) = Source
.queue[Int](10, OverflowStrategy.backpressure)
.map(x => x * x)
.toMat(Sink.asPublisher(false))(Keep.both)
.run()
val fromQueue = Source(0 until 10).runForeach(queue.offer(_))
val source = Source.fromPublisher(pub)
source.runForeach(x => print(x + " "))
// Output:
// 0 1 4 9 16 25 36 49 64 81

Running spark streaming program on cluster

I have implemented a scala program to find the most popular hashtags on twitter using spark streaming. I have implemented it on eclipse scala IDE. I have access to a cluster called comet, operated by SDSC. I want to run my scala program on this cluster.
Please guide me through the steps to do the above as I have very limited idea about linux.
Below is the code
object PopularHashtags {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("../twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
def main(args: Array[String]) {
setupTwitter()
val ssc = new StreamingContext("local[*]", "PopularHashtags", Seconds(1))
setupLogging()
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(status => status.getText())
val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
val hashtags = tweetwords.filter(word => word.startsWith("#"))
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))
val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))
val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print
ssc.checkpoint("C:/checkpoint/")
ssc.start()
ssc.awaitTermination()
}
}
P.S.: The twitter API keys are stored in a text file in my eclipse workspace.

Class import error in Scala/Spark

I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}

Streaming CSV with akka-http in scala

I am very new to akka-http, and I would like to stream a csv with an arbitrary number of lines.
For instance, I would like to return :
a,1
b,2
c,3
with the following code
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, ??? using map)
}
}
Http().bindAndHandle(route,"localhost",8080)
Thanks for your help
EDIT: Thanks to Ramon J Romero y Vigil
package test
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.model.HttpCharsets.`UTF-8`
import akka.http.scaladsl.model._
import akka.http.scaladsl.server.Directives._
import akka.stream._
import akka.util.ByteString
import scala.collection.mutable
object Test{
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val mapStream = Stream.fromIterator(() => map.toIterator)
.map((k: String, v: Int) => s"$k,$v")
.map(ByteString.apply)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, mapStream)
}
}
Http().bindAndHandle(route, "localhost", 8080)
}
}
With this code I have two compile error:
Error:(29, 28) value fromIterator is not a member of object scala.collection.immutable.Stream
val mapStream = Stream.fromIterator(() => map.toIterator)
Error:(38, 11) overloaded method value apply with alternatives:
(contentType: akka.http.scaladsl.model.ContentType,file: java.io.File,chunkSize: Int)akka.http.scaladsl.model.UniversalEntity <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.stream.scaladsl.Source[akka.util.ByteString,Any])akka.http.scaladsl.model.HttpEntity.Chunked <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.util.ByteString)akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType,bytes: Array[Byte])akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType.NonBinary,string: String)akka.http.scaladsl.model.HttpEntity.Strict
cannot be applied to (akka.http.scaladsl.model.ContentType.WithCharset, List[akka.util.ByteString])
HttpEntity(`text/csv`, mapStream)
I used a List of tuples to get arround the first issue (hower i do not know how to stream a map in Scala)
No idea for the second
Thanks for your help.
(I am using scala 2.11.8)
Use the apply function in HttpEntity that takes in a Source[ByteString,Any]. The apply creates a Chunked entity. You can read your file using code based on the documentation for streaming file IO using an akka stream Source:
import akka.stream.scaladsl._
val file = Paths.get("yourFile.csv")
val entity = HttpEntity(`txt/csv`, FileIO.fromPath(file))
The stream will break up your file into chunk sizes, default is currently set to 8192.
To stream the map that you've created you can use a similar trick:
val mapStream = Source.fromIterator(() => map.toIterator)
.map( (k : String, v : Int) => s"$k,$v" )
.map(ByteString.apply)
val mapEntity = HttpEntity(`test/csv`, mapStream)

How to access main method variables from scalatest-scala

I am couple of days into scala programming
I was playing with the scala code here
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args:Array[String]) {
val logFile = "../samplelog.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
def getNumAs : Long = numAs
def getNumBs : Long = numBs
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
My scalatest is as follows
import org.scalatest._
class simpleAppTest extends FunSuite with Matchers {
test ("number of a's and b's")
val acount=SimpleApp.numAs
val bcount=SimpleApp.numBs
acount should be(2)
bcount should be(1)
}
The issue is that I am unable to get values of getnumAs and getnumBs. The compiler complains that getnumAs/getnumBs is not a value of unit. I tried looking for scaladocs or sample code but have no clue on how to resolve this.
kindly help
thanks