Integrerate Spark Streaming and Kafka, but get bad symbolic reference error - scala

I am trying to integrate Spark streaming and Kafka. I wrote my source code in intellij idea IDE; the complier compiled the code without any error, but when I try to build the jar file, an error message is generated that shows:
Error:scalac: bad symbolic reference. A signature in KafkaUtils.class
refers to term kafka in package which is not available. It may
be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when
compiling KafkaUtils.class.
I did research on google, many people say this is because of different versions between Scala version and spark_streaming_kafka jar file. But I have checked the version, they are the same.
Does someone know why this error happened?
Here are more details:
scala version : 2.10
spark streaming kafka jar version : spark_streaming_kafka_2.10-1.20.jar,spark_streaming_2.10-1.20.jar
My source code:
object Kafka {
def main(args: Array[String]) {
val master = "local[*]"
val zkQuorum = "localhost:2181"
val group = ""
val topics = "test"
val numThreads = 1
val conf = new SparkConf().setAppName("Kafka")
val ssc = new StreamingContext(conf, Seconds(2))
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.print()
ssc.start()
ssc.awaitTermination()
}
}

Related

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

How to read HDFS file from Scala code

I am new to Scala and HDFS:
I am just wondering I am able to read local file from Scala code but how to read from HDFS:
import scala.io.source
object ReadLine {
def main(args:Array[String]) {
if (args.length>0) {
for (line <- Source.fromLine(args(0)).getLine())
println(line)
}
}
in Argument I have passed hdfs://localhost:9000/usr/local/log_data/file1.. But its giving FileNotFoundException error
I am definitely missing something.. can anyone help me out here ?
scala.io.source api cannot read from HDFS. Source is used to read from local file system.
Spark
If you want to read from hdfs then I would recommend to use spark where you would have to use sparkContext.
val lines = sc.textFile(args(0)) //args(0) should be hdfs:///usr/local/log_data/file1
No Spark
If you don't want to use spark then you should go with BufferedReader or StreamReader or hadoop filesystem api. for example
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))

Apache Spark - java.lang.NoSuchMethodError: breeze.linalg.Vector$.scalarOf()Lbreeze/linalg/support/ScalarOf

Here is the error:
Exception in thread "main" java.lang.NoSuchMethodError: breeze.linalg.Vector$.scalarOf()Lbreeze/linalg/support/ScalarOf;
at org.apache.spark.ml.knn.Leaf$$anonfun$4.apply(MetricTree.scala:95)
at org.apache.spark.ml.knn.Leaf$$anonfun$4.apply(MetricTree.scala:93)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
at org.apache.spark.ml.knn.Leaf$.apply(MetricTree.scala:93)
at org.apache.spark.ml.knn.MetricTree$.build(MetricTree.scala:169)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:388)
at org.apache.spark.ml.classification.KNNClassifier.train(KNNClassifier.scala:109)
at org.apache.spark.ml.classification.KNNClassifier.fit(KNNClassifier.scala:117)
at SparkKNN$.main(SparkKNN.scala:23)
at SparkKNN.main(SparkKNN.scala)
Here is the program that is triggering the error:
object SparkKNN {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
//read in raw label and features
val training = spark.read.format("com.databricks.spark.csv").option("header", true).load("E:/Machine Learning/knn_input.csv")
var df = training.selectExpr("cast(label as double) label", "cast(feature1 as int) feature1","cast(feature2 as int) feature2","cast(feature3 as int) feature3")
val assembler = new VectorAssembler().setInputCols(Array("feature1","feature2","feature3")).setOutputCol("features")
df = assembler.transform(df)
//MLUtils.loadLibSVMFile(sc, "C:/Program Files (x86)/spark-2.0.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt").toDF()
val knn = new KNNClassifier()
.setTopTreeSize(df.count().toInt / 2)
.setK(10)
val splits = df.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val knnModel = knn.fit(trainingData)
val predicted = knnModel.transform(testData)
predicted.show()
}
}
I am using Apache spark 2.0 with scala version 2.11.8. It looks like a version difference issue. Any ideas?
Spark MLLib 2.0 brings in this version of Breeze:
"org.scalanlp" % "breeze_2.11" % "0.11.2"
You must have another library in your classpath that also has a dependency on Breeze but a different version, and that's the one being loaded. As a result, MLLib is operating with a different version of Breeze at runtime than was around at compile-time.
You have multiple options. You can find that undesirable transitive dependency on Breeze and exclude it. You can add a direct dependency on the version of that library that has the same Breeze dependency MLLib does. Or you can add a direct dependency on the Breeze version MLLib needs.

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

How to apply RDD function on DStream while writing code in scala

I am trying to write a simple Spark code in Scala.
Here I am getting a DStream. I am successfully able to print this DStream. But when I am trying to do any kind of "foreach" ,"foreachRDD" or "transform" function on this DStream then during execution of my program my console is just getting freeze. Here I am not getting any error but the console is just becoming non-responsive until I manually terminate eclipse console operation. I am attaching the the code here. Kindly tell me what am I doing wrong.
My main objective is to apply RDD operations on DStream and in order to do it as per my knowledge I need to convert my DStream into RDD by using "foreach" ,"foreachRDD" or "transform" function.
I have already achieved same by using Java. But in scala I am having this problem.
Is anybody else facing the same issue? If not then kindly help me out. Thanks
I am Writing a sample code here
object KafkaStreaming {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val ssc = new StreamingContext("local", "KafkaWordCount", Seconds(2))
val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val splitLines:DStream[String] = lines.flatMap(_.split("\n"))
val pairAlarm = splitLines.map(
x=>{
//Some Code
val alarmPair = new Tuple2(key, value)
alarmPair
}
)
//pairAlarm.print
pairAlarm.foreachRDD(x=>{
println("1 : "+x.first)
x.collect // When the execution reaching this part its getting freeze
println("2: "+x.first)
})
ssc.start()
ssc.awaitTermination()
}
}
I don't know if this is your problem, but I had a similar one. My program just stopped printing after several iterations. No exceptions etc. just stops printing after 5-6 prints.
Changing this:
val ssc = new StreamingContext("local", "KafkaWordCount", Seconds(2))
to this:
val ssc = new StreamingContext("local[2]", "KafkaWordCount", Seconds(2))
solved the problem. Spark requires at least 2 threads to run and the documentation examples are misleading as they use just local as well.
Hope this helps!