spark streaming Twitter- filter languaje - scala

I know that this question has been treated in several posts, but I can't find out how to solve my problem. I am trying to filter twitts by languaje. I have read in this forum that I have to use twitter4j api. I have already added it in my dependencies:
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.3</version>
</dependency>
My code is:
import twitter4j.api
[....]
val sc = new SparkContext("local", "Simple", "$SPARK_HOME", List("target/streamingTwitter-1.0.jar"))
val ssc = new StreamingContext(sc, Seconds(10))
var filter = Array("filter")
val tweets = TwitterUtils.createStream(ssc, None, filter).filter(status => _.getLang == "es")
The error is:
cannot resolve symbol getlang
Why the compiler doesnt recognize getLang method? This method is supposed to be implemented in twitter4j api, right? Is not enough to import twitter4j and set dependencies in order to use its methods?

Related

How to write and update by kudu API in Spark 2.1

I want to write and update by Kudu API.
This is the maven dependency:
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.1.0</version>
</dependency>
In the following code, I have no idea about KuduContext parameter.
My code in spark2-shell:
val kuduContext = new KuduContext("master:7051")
Also the same error in Spark 2.1 streaming:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
val sparkConf = new SparkConf().setAppName("DirectKafka").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val bb = spark.read.options(Map("kudu.master" -> "master:7051","kudu.table" -> "table")).kudu //good
val kuduContext = new KuduContext("master:7051") //error
})
Then the error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243). To ignore this error, set
spark.driver.allowMultipleContexts = true. The currently running
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
Update your version of Kudu to the latest one (currently 1.5.0). The KuduContext takes the SparkContext as an input parameter in later versions and that should prevent this problem.
Also, do the initial Spark initialization outside of the foreachRDD. In the code you provided, move both the spark and kuduContext out of the foreach. Also, you do not need to create a separate sparkConf, you can use the newer SparkSession only.
val spark = SparkSession.builder.appName("DirectKafka").master("local[*]").getOrCreate()
import spark.implicits._
val kuduContext = new KuduContext("master:7051", spark.sparkContext)
val bb = spark.read.options(Map("kudu.master" -> "master:7051", "kudu.table" -> "table")).kudu
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
// do something with the bb table and messages
})

Parsing json in spark

I was using json scala library to parse a json from a local drive in spark job :
val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
val currency=mainJson.get("currency").get.asInstanceOf[String]
But when i try to use the same parser by pointing to hdfs file location it doesnt work:
val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)
and gives me an error:
java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
... 128 elided
How can i use Json.parseFull library to get data from hdfs file location ?
Thanks
Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.
In Spark 2.0+ :
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.read.format("json").json("json/file/location/in/hdfs")
df.show()
with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson
will be computed in single machine only.
Maven dependencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Edit: (as per comment to read file from hdfs)
val hdfs = org.apache.hadoop.fs.FileSystem.get(
new java.net.URI("hdfs://ITS-Hadoop10:9000/"),
new org.apache.hadoop.conf.Configuration()
)
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()
code credits: from another SO
question
Maven dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>
It is much more easy in spark 2.0
val df = spark.read.json("json/file/location/in/hdfs")
df.show()
One can use following in Spark to read the file from HDFS:
val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

twitterStream not found

I'm trying to compile my first scala program and I'm using twitterStream to get tweets, here is a snippet of my code:
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import TutorialHelper._
object Tutorial {
def main(args: Array[String]) {
// Location of the Spark directory
val sparkHome = "/home/shaza90/spark-1.1.0"
// URL of the Spark cluster
val sparkUrl = TutorialHelper.getSparkUrl()
// Location of the required JAR files
val jarFile = "target/scala-2.10/tutorial_2.10-0.1-SNAPSHOT.jar"
// HDFS directory for checkpointing
val checkpointDir = TutorialHelper.getHdfsUrl() + "/checkpoint/"
// Configure Twitter credentials using twitter.txt
TutorialHelper.configureTwitterCredentials()
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream()
val statuses = tweets.map(status => status.getText())
statuses.print()
ssc.checkpoint(checkpointDir)
ssc.start()
}
}
When compiling I'm getting this error message:
value twitterStream is not a member of org.apache.spark.streaming.StreamingContext
Do you know if I'm missing any library or dependency?
In this case you want a stream of tweets. We all know that Sparks provides Streams. Now, lets check if Spark itself provides something for interacting with twitter specifically.
Open Spark API-docs -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#package
Now search for twitter and bingo... there is something called TwitterUtils in package org.apache.spark.streaming. Now since it is called TwitterUtils and is in package org.apache.spark.streaming, I think it will provide helpers to create stream from twitter API's.
Now lets click on TwitterUtils and goto -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.dstream.ReceiverInputDStream
And yup... it has a method with following signature
def createStream(
ssc: StreamingContext,
twitterAuth: Option[Authorization],
filters: Seq[String] = Nil,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[Status]
It returns a ReceiverInputDStream[ Status ] where Status is twitter4j.Status.
Parameters are further explained
ssc
StreamingContext object
twitterAuth
Twitter4J authentication, or None to use Twitter4J's default OAuth authorization; this uses the system properties twitter4j.oauth.consumerKey, twitter4j.oauth.consumerSecret, twitter4j.oauth.accessToken and twitter4j.oauth.accessTokenSecret
filters
Set of filter strings to get only those tweets that match them
storageLevel
Storage level to use for storing the received objects
See... API docs are simple. I believe, now you should be a little more motivated to read API docs.
And... This means you need to look a little( at least getting started part ) into twitter4j documentation too.
NOTE :: This answer is specifically written to explain "Why not to shy
away from API docs ?". And was written after careful thoughts. So
please, do not edit unless your edit makes some significant
contribution.

Integrerate Spark Streaming and Kafka, but get bad symbolic reference error

I am trying to integrate Spark streaming and Kafka. I wrote my source code in intellij idea IDE; the complier compiled the code without any error, but when I try to build the jar file, an error message is generated that shows:
Error:scalac: bad symbolic reference. A signature in KafkaUtils.class
refers to term kafka in package which is not available. It may
be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when
compiling KafkaUtils.class.
I did research on google, many people say this is because of different versions between Scala version and spark_streaming_kafka jar file. But I have checked the version, they are the same.
Does someone know why this error happened?
Here are more details:
scala version : 2.10
spark streaming kafka jar version : spark_streaming_kafka_2.10-1.20.jar,spark_streaming_2.10-1.20.jar
My source code:
object Kafka {
def main(args: Array[String]) {
val master = "local[*]"
val zkQuorum = "localhost:2181"
val group = ""
val topics = "test"
val numThreads = 1
val conf = new SparkConf().setAppName("Kafka")
val ssc = new StreamingContext(conf, Seconds(2))
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.print()
ssc.start()
ssc.awaitTermination()
}
}