I'm trying to create a Streaming context but it keeps throwing exception in the creating StreamingContext line
Here's my code
val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Minutes(15))
And here's the stack trace
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
at org.apache.spark.streaming.scheduler.JobGenerator.liftedTree1$1(JobGenerator.scala:52)
at org.apache.spark.streaming.scheduler.JobGenerator.<init>(JobGenerator.scala:51)
at org.apache.spark.streaming.scheduler.JobScheduler.<init>(JobScheduler.scala:55)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:184)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:76)
at vn.fpt.fplay.kafka.StreamConsumer$.main(StreamConsumer.scala:19)
at vn.fpt.fplay.kafka.StreamConsumer.main(StreamConsumer.scala)
I've searched everywhere but can't find out what this error is? Does anybody know? Any help would be much appreciated.
As the stack trace says
org.apache.spark.streaming.scheduler.JobGenerator (class in spark streaming) is trying to call a method org.apache.spark.util.Utils (class in spark core)
There are two reasons for this.
Spark Core is not added to your project dependency.
There's a mismatch of version between spark-core and spark streaming library
Check your sbt / maven and try to add or change to suitable version.
Related
I've been trying to learn Spark & Scala, and have an environment setup in IntelliJ.
I'd previously been using SparkContext to initialise my Spark instance successfully, using the following code:
import org.apache.spark._
val sc = new SparkContext("local[*]", "SparkTest")
When I tried to start loading .csv data in, most information I found used spark.read.format("csv").load("filename.csv") but this requires initialising a SparkSession object using:
val spark = SparkSession
.master("local")
.builder()
.appName("Test")
.getOrCreate()
But when I tried to use this, there doesn't seem to be any SparkSession in org.apache.spark._ in my version of Spark 3.x.
As far as I'm aware, the use of SparkContext is the Spark 1.x method, and SparkSession is Spark 2.x where spark.sql is built-in to the SparkSession object.
My question is whether I'm incorrectly trying to load SparkSession or if there's a separate way to approach initialising Spark (and loading .csv files) in Spark 3?
Spark version: 3.3.0
Scala version: 2.13.8
If you are using Maven type project then try adding dependencies to the POM file. Otherwise, for the sake of troubleshooting, create a new Maven type project, add dependencies and check whether you are still having same issue.
How can I set a Hadoop version for the Spark application without submitting a jar and defining specific Hadoop binary? And is it even possible?
I am just not really sure how can Hadoop version be changed while submitting Spark application.
Something like this does not work:
val sparkSession = SparkSession
.builder
.master("local[*]")
.appName("SparkJobHDFSApp")
.getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("hadoop.common.configuration.version", "2.7.4")
It can't be. The Spark Master and Workers each have their own Hadoop JARs on the classpath with which your own application must be compatible with
I'm using spark 2.2.0, and I would like to create unit tests for spark with hive support.
The test should relay on a metastore that is stored on the local disk (as explained in the programming guide)
I define the session in the following way:
val spark = SparkSession
.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
the creation of the spark session fails with the error:
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
I managed to work around this error by adding the following dependency:
"org.datanucleus" % "datanucleus-accessplatform-jdo-rdbms" % "3.2.9"
This is strange to me, since this library is already included in spark.
Is there another way to solve this?
I wouldn't wan't to keep track of the library and update it with every new spark version.
I am upgrading to spark 2.0 from 1.6 in a play-scala application and am not quite sure how to set the jar files I want. Previously a SparkConf would be defined and one of the methods I could call was setJars, which allowed me to specify all of the jar files I wanted. Now I am using SparkSession builder to construct my spark conf and spark context and I do not see any similar methods for specifying the jar files? How can I do this?
Here is how I previously created my sparkconf:
val sparkConf = new SparkConf().setMaster(sparkMaster).setAppName(sparkAppName).
set("spark.yarn.jar", "hdfs:///user/hadoop/spark-assembly-1.6.1-hadoop2.7.2.jar").
set("spark.eventLog.dir", "hdfs:///var/log/spark/apps").
set("spark.eventLog.enabled", "true").
set("spark.executorEnv.JAVA_HOME", "/usr/lib/jvm/jre-1.8.0-openjdk").
setJars(Seq(
"ALL JAR FILES LISTED HERE"
))
What can I do using sparksession builder to accomplish the same thing as "setJars"?
You can use .config(key, value) method to set spark.jars:
SparkSession.builder
.appName(sparkAppName)
.master(sparkMaster)
.config("spark.jars", commaSeparatedListOfJars)
.config(/* other stuff */)
.getOrCreate()
I am trying to access the streaming tweets from Spark Streaming.
This is the software configuration.
Ubuntu 14.04.2 LTS
scala -version
Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL
spark-submit --version
Spark version 1.6.0
Following is the code.
object PrintTweets
{
def main(args: Array[String]) {
// Configure Twitter credentials using twitter.txt
setupTwitter()
// Set up a Spark streaming context named "PrintTweets" that runs locally using
// all CPU cores and one-second batches of data
val ssc = new StreamingContext("local[*]", "PrintTweets", Seconds(1))
// Get rid of log spam (should be called after the context is set up)
setupLogging()
// Create a DStream from Twitter using our streaming context
val tweets = TwitterUtils.createStream(ssc, None)
// Now extract the text of each status update into RDD's using map()
val statuses = tweets.map(status => status.getText())
// Print out the first ten
statuses.print()
// Kick it all off
ssc.start()
ssc.awaitTermination()
}
}
Utilities.scala
object Utilities {
/** Makes sure only ERROR messages get logged to avoid log spam. */
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
/** Configures Twitter service credentials using twiter.txt in the main workspace directory */
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("./data/twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
}
Issues:
Since it needs the twitter4j library, i have added
twitter4j-core-4.0.4, twitter4j-stream-4.0.4 in eclipse build path as external jars.
Then i ran the program, it didnt throw any error. But the tweets not appearing in console. It were empty.
So i see some forums and downgraded twitter4j to 3.0.3. Also in Eclipse i chosen Scala 2.10 Library container in Build Path window.
After that i got java.lang.NoSuchMethodError run-time error.
16/05/14 11:46:01 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StreamListener;)V
at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Please help me to resolve this. Initially i have installed spark by built using Scala 2.11. Is that the problem. Do i need uninstall everything and re-install Scala 2.10, then Spark pre-compiled package.
Or apart from Scala 2.11, do i need to have Scala 2.10 in my system?
The above exception seems to be caused by the incompatibility of spark version 1.6.0 and twitter4j 3.0.3 version.
twitter4j.TwitterStream which is being passed in the onStart method of org.apache.spark.streaming.twitter.TwitterReceiver, has method addListener which takes instance of twitter4j.StreamListener.
twitter4j 3.0.3 version has no method twitter4j.TwitterStream.addListener(StreamListener), instead it has few other addListener methods, which take the subclass of StreamListener.
twitter4j 4.0.4 version has the desired method, so that's why no error comes with this library. So changing to twitter4j 3.0.3 version will not solve the problem.
Problem is somewhere else.
In my case.
I had spark java project.
I cleaned pom file and start adding in order. First resolved spark related errors, then spark launcher, next on ward based on bigger library.
Note I was using cdh6.2.0 environment