twitterStream not found - scala

I'm trying to compile my first scala program and I'm using twitterStream to get tweets, here is a snippet of my code:
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import TutorialHelper._
object Tutorial {
def main(args: Array[String]) {
// Location of the Spark directory
val sparkHome = "/home/shaza90/spark-1.1.0"
// URL of the Spark cluster
val sparkUrl = TutorialHelper.getSparkUrl()
// Location of the required JAR files
val jarFile = "target/scala-2.10/tutorial_2.10-0.1-SNAPSHOT.jar"
// HDFS directory for checkpointing
val checkpointDir = TutorialHelper.getHdfsUrl() + "/checkpoint/"
// Configure Twitter credentials using twitter.txt
TutorialHelper.configureTwitterCredentials()
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream()
val statuses = tweets.map(status => status.getText())
statuses.print()
ssc.checkpoint(checkpointDir)
ssc.start()
}
}
When compiling I'm getting this error message:
value twitterStream is not a member of org.apache.spark.streaming.StreamingContext
Do you know if I'm missing any library or dependency?

In this case you want a stream of tweets. We all know that Sparks provides Streams. Now, lets check if Spark itself provides something for interacting with twitter specifically.
Open Spark API-docs -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#package
Now search for twitter and bingo... there is something called TwitterUtils in package org.apache.spark.streaming. Now since it is called TwitterUtils and is in package org.apache.spark.streaming, I think it will provide helpers to create stream from twitter API's.
Now lets click on TwitterUtils and goto -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.dstream.ReceiverInputDStream
And yup... it has a method with following signature
def createStream(
ssc: StreamingContext,
twitterAuth: Option[Authorization],
filters: Seq[String] = Nil,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[Status]
It returns a ReceiverInputDStream[ Status ] where Status is twitter4j.Status.
Parameters are further explained
ssc
StreamingContext object
twitterAuth
Twitter4J authentication, or None to use Twitter4J's default OAuth authorization; this uses the system properties twitter4j.oauth.consumerKey, twitter4j.oauth.consumerSecret, twitter4j.oauth.accessToken and twitter4j.oauth.accessTokenSecret
filters
Set of filter strings to get only those tweets that match them
storageLevel
Storage level to use for storing the received objects
See... API docs are simple. I believe, now you should be a little more motivated to read API docs.
And... This means you need to look a little( at least getting started part ) into twitter4j documentation too.
NOTE :: This answer is specifically written to explain "Why not to shy
away from API docs ?". And was written after careful thoughts. So
please, do not edit unless your edit makes some significant
contribution.

Related

What FileOutputCommitter should be used in when writing AVRO files in Spark?

When saving an RDD to S3 in AVRO, I get the following warning in the console:
Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:
import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
object AvroUtil {
def write[T](
path: String,
schema: Schema,
avroRdd: RDD[T],
job: Job = Job.getInstance()): Unit = {
val intermediateRdd = avroRdd.mapPartitions(
f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
preservesPartitioning = true
)
job.getConfiguration.set("avro.output.codec", "snappy")
job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
AvroJob.setOutputKeySchema(job, schema)
intermediateRdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration
)
}
}
I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.
You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.
Follow this link where author has explained similar with example.

Simple Spark Scala Post to External Rest API Example

New to Spark Scala, I just want to read a json file and post the content to an external rest api server. Can anyone provide a simple example? or provide guidelines?
You probably do not want to use Spark for this. Spark is an analytical engine for processing large amounts of data - unless you're reading in massive amounts of json from hdfs, this task is more suitable for scala. You should look up ways to read a json file in scala, and send that content to a server in scala.
Here are some great places to get started:
Scala Read JSON file
https://alvinalexander.com/scala/how-to-send-json-post-data-to-restful-url-in-scala
The following is from the above URL:
import java.io._
import org.apache.commons._
import org.apache.http._
import org.apache.http.client._
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.DefaultHttpClient
import java.util.ArrayList
import org.apache.http.message.BasicNameValuePair
import org.apache.http.client.entity.UrlEncodedFormEntity
import com.google.gson.Gson
case class Person(firstName: String, lastName: String, age: Int)
object HttpJsonPostTest extends App {
// create our object as a json string
val spock = new Person("Leonard", "Nimoy", 82)
val spockAsJson = new Gson().toJson(spock)
// add name value pairs to a post object
val post = new HttpPost("http://localhost:8080/posttest")
val nameValuePairs = new ArrayList[NameValuePair]()
nameValuePairs.add(new BasicNameValuePair("JSON", spockAsJson))
post.setEntity(new UrlEncodedFormEntity(nameValuePairs))
// send the post request
val client = new DefaultHttpClient
val response = client.execute(post)
println("--- HEADERS ---")
response.getAllHeaders.foreach(arg => println(arg))
}

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

Will standalone scala program takes advantage of distributed/parallel processing? or does spark Scala require separate code?

First of all sorry for asking the basic doubt over here, but still explanation for the below will be appreciable..
i am very new to scala and spark, so my doubt is if i write a standalone scala program, and execute it on spark(1 master 3 worker), will the scala program takes advantage of disturbed/parallel processing, or should i need to write a separate program to get an advantage of distributed processing??
For example, we have a scala code that process a particular formatted file to comma separated file, it takes a directory as input and parses all file and write an output to single file(each file will be usually 100-200MB). So here is the code.
import scala.io.Source
import java.io.File
import java.io.PrintWriter
import scala.collection.mutable.ListBuffer
import java.util.Calendar
//import scala.io.Source
//import org.apache.spark.SparkContext
//import org.apache.spark.SparkContext._
//import org.apache.spark.SparkConf
object Parser {
def main(args:Array[String]) {
//val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
//val sc = new SparkContext(conf)
var inp = new File(args(0))
var ext: String = ""
if(args.length == 1)
{ ext = "log" } else { ext = args(1) }
var files: List[String] = List("")
if (inp.exists && inp.isDirectory) {
files = getListOfFiles(inp,ext)
}
else if(inp.exists ) {
files = List(inp.toString)
}
else
{
println("Enter the correct Directory/File name");
System.exit(0);
}
if(files.length <=0 )
{
println(s"No file found with extention '.$ext'")
}
else{
var out_file_name = "output_"+Calendar.getInstance().getTime.toString.replace(" ","-").replace(":","-")+".log"
var data = getHeader(files(0))
var writer=new PrintWriter(new File(out_file_name))
var record_count = 0
//var allrecords = data.mkString(",")+("\n")
//writer.write(allrecords)
for(eachFile <- files)
{
record_count += parseFile(writer,data,eachFile)
}
writer.close()
println(record_count +s" processed into $out_file_name")
}
//all func are defined here.
}
Files from the specific dir are read using scala.io
Source.fromFile(file).getLines
So my doubt is will the above code(standalone prg) can be executed on distributed spark system? will i get an advantage of parallel processing??
ok, how about using sc to read file, will it then uses distributed processing
val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
val sc = new SparkContext(conf)
...
...
for(eachFile <- files)
{
record_count += parseFile(sc,writer,data,eachFile)
}
------------------------------------
def parseFile(......)
sc.textFile(file).getLines
So if i edit the top code to make use of sc then will it process on distributes spark system.
No it won't. To make use of distributed computing using Spark, you need to use SparkContext.
If you run the application you have provided using spark-submit you will not be using the Spark cluster at all. You have to rewrite it to use the SparkContext. Please read through the Spark Programming Guide.
It is extremely helpful to watch some introductory videos on Youtube for getting to know how Apache Spark works in general.
For example, these:
https://www.youtube.com/watch?v=7k4yDKBYOcw
https://www.youtube.com/watch?v=rvDpBTV89AM&list=PLF6snu5Jy-v-WRAcCfWNHks7lcNO-zrTI&index=4
Is is very important to understand it for using Spark.
"advantage of distributed processing"
Using Spark can give you advantages of distributing processing on multiple server cluster. So if you are going to move your application later to the cluster, it makes sense to develop application using Spark model and corresponding API.
Well, you can run Spark application locally on your local machine but in this case you won't get all the advantages the Spark can provide.
Anyway, as it is said before, Spark is a special framework with its own libraries for developtment. So you have to rewrite your application using Spark context and Spark API, i.e. special objects like RDDs or Dataframes and corresponding methods.