I have problem with using updateStateByKey() function. I have following, simple code (written base on book: "Learning Spark - Lighting Fast Data Analysis"):
object hello {
def updateStateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
Some(runningCount.getOrElse(0) + newValues.size)
}
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[5]").setAppName("AndrzejApp")
val ssc = new StreamingContext(conf, Seconds(4))
ssc.checkpoint("/")
val lines7 = ssc.socketTextStream("localhost", 9997)
val keyValueLine7 = lines7.map(line => (line.split(" ")(0), line.split(" ")(1).toInt))
val statefullStream = keyValueLine7.updateStateByKey(updateStateFunction _)
ssc.start()
ssc.awaitTermination()
}
}
My build.sbt is:
name := "stream-correlator-spark"
version := "1.0"
scalaVersion := "2.11.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.3.1" % "provided"
)
When I build it with sbt assembly command everything goes fine. When I run this on spark cluster in local mode I got error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/dstream/DStream$
at hello$.main(helo.scala:25)
...
line 25 is:
val statefullStream = keyValueLine7.updateStateByKey(updateStateFunction _)
I feel this might be some compatibility version problem but I don't know what might be the reason and how to resolve this.
I would be really grateful for help!
When you are writing "provided" in the SBT this means exactly that your library is provided by the environment and need no to be included in the package.
Try to remove "provided" mark from "spark-streaming" library.
You can add "provided" back when you need to submit your app to a spark cluster to run. The benefit of having "provided" is that the result fat jar will not include classes from the provided dependencies, which will yield a much smaller fat jar, comparing to not having "provided". In my case, the result jar will be around 90M without "provided" and then shrink to 30+M with "provided".
Related
When I try sbt package in my below code I get these following errors
object apache is not a member of package org
not found: value SparkSession
MY Spark Version: 2.4.4
My Scala Version: 2.11.12
My build.sbt
name := "simpleApp"
version := "1.0"
scalaVersion := "2.11.12"
//libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"
libraryDependencies ++= {
val sparkVersion = "2.4.4"
Seq( "org.apache.spark" %% "spark-core" % sparkVersion)
}
my Scala project
import org.apache.spark.sql.SparkSession
object demoapp {
def main(args: Array[String]) {
val logfile = "C:/SUPPLENTA_INFORMATICS/demo/hello.txt"
val spark = SparkSession.builder.appName("Simple App in Scala").getOrCreate()
val logData = spark.read.textFile(logfile).cache()
val numAs = logData.filter(line => line.contains("Washington")).count()
println(s"Lines are: $numAs")
spark.stop()
}
}
If you want to use Spark SQL, you also have to add the spark-sql module to the dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
Also, note that you have to reload your project in SBT after changing the build definition and import the changes in intelliJ.
This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I have the following scala code and am using sbt to compile and run this. sbt run works as expected.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{StreamingContext, Seconds}
import com.couchbase.spark.streaming._
object StreamingExample {
def main(args: Array[String]): Unit = {
// Create the Spark Config and instruct to use the travel-sample bucket
// with no password.
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("StreamingExample")
.set("com.couchbase.bucket.travel-sample", "")
// Initialize StreamingContext with a Batch interval of 5 seconds
val ssc = new StreamingContext(conf, Seconds(5))
// Consume the DCP Stream from the beginning and never stop.
// This counts the messages per interval and prints their count.
ssc
.couchbaseStream(from = FromBeginning, to = ToInfinity)
.foreachRDD(rdd => {
rdd.foreach(message => {
//println(message.getClass());
message.getClass();
if(message.isInstanceOf[Mutation]) {
val document = message.asInstanceOf[Mutation].key.map(_.toChar).mkString
println("mutated: " + document);
} else if( message.isInstanceOf[Deletion]) {
val document = message.asInstanceOf[Deletion].key.map(_.toChar).mkString
println("deleted: " + document);
}
})
})
// Start the Stream and await termination
ssc.start()
ssc.awaitTermination()
}
}
but this fails when run as a spark job like below :
spark-submit --class "StreamingExample" --master "local[*]" target/scala-2.11/spark-samples_2.11-1.0.jar
The error is java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key()
Following is my build.sbt
lazy val root = (project in file(".")).
settings(
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
)
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
The spark version running on my machine is 2.4.0 using scala 2.11.12.
Observations:
I do not see com.couchbase.client_spark-connector_2.11-2.2.0 in my spark jars ( /usr/local/Cellar/apache-spark/2.4.0/libexec/jars ), but the older version com.couchbase.client_spark-connector_2.10-1.2.0.jar exists.
Why is spark-submit not working?
how does sbt manage to run this? where does it download the
dependencies?
Please ensure that both the Scala version and the spark connector library version used by SBT and your spark installation are the same.
I had run into a similar problem when I was trying to run a sample Flink job on my system. It was being caused by version mismatch.
I was building this small demo code for Spark streaming using twitter. I have added the required dependencies as shown by http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ and I am using sbt to build jars. The project build successfully and only problem seems to be is- it is not able to find the TwitterUtils class.
The scala code is given below
build.sbt
name := "twitterexample"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0",
"org.twitter4j" % "twitter4j-core" % "4.0.4",
"org.twitter4j" % "twitter4j-stream" % "4.0.4"
)
The main scala file is
TwitterCount.scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TwitterCount {
def main(args: Array[String]): Unit = {
val consumerKey = "abc"
val consumerSecret ="abc"
val accessToken = "abc"
val accessTokenSecret = "abc"
val lang ="english"
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret",consumerSecret)
System.setProperty("twitter4j.oauth.accessToken",accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
val conf = new SparkConf().setAppName("TwitterHashTags")
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc,None)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.map{status => status.split("""\s+""")}
val hashTags = words.filter{ word => word.startsWith("#StarWarsDay")}
val hashcounts = hashTags.count()
hashcounts.print()
ssc.start
ssc.awaitTermination()
}
Then I am building the project using
sbt package
and I submitting the generated jars using
spark-submit --class "TwitterCount" --master local[*] target/scala-2.11/twitterexample_2.11-1.0.jar
Please help me with this.
Thanks
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
You are missing package name in your code. Your spark submit command should be like this.
--class com.spark.examples.TwitterCount
I found the solution at last.
java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
I have to build the jars using
sbt assembly
but I'm still wondering what's the difference in jars that I make using
sbt package
anyone knows? plz share
I'm trying to get the DataStax spark cassandra connector working. I've created a new SBT project in IntelliJ, and added a single class. The class and my sbt file is given below. Creating spark contexts seem to work, however, the moment I uncomment the line where I try to create a cassandraTable, I get the following compilation error:
Error:scalac: bad symbolic reference. A signature in CassandraRow.class refers to term catalyst
in package org.apache.spark.sql which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling CassandraRow.class.
Sbt is kind of new to me, and I would appreciate any help in understanding what this error means (and of course, how to resolve it).
name := "cassySpark1"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.1.0-alpha2" withSources() withJavadoc()
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
And my class:
import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._
object HelloWorld { def main(args:Array[String]): Unit ={
System.setProperty("spark.cassandra.query.retry.count", "1")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "cassandra-hostname")
.set("spark.cassandra.username", "cassandra")
.set("spark.cassandra.password", "cassandra")
val sc = new SparkContext("local", "testingCassy", conf)
> //val foo = sc.cassandraTable("keyspace name", "table name")
val rdd = sc.parallelize(1 to 100)
val sum = rdd.reduce(_+_)
println(sum) } }
You need to add spark-sql to dependencies list
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.1.0"
Add library dependency in your project's pom.xml file. It seems they have changed the Vector.class dependencies location in the new refactoring.
I'm trying to get a simple "hello world" server running using spray with scala 2.11:
import spray.routing.SimpleRoutingApp
import akka.actor.ActorSystem
object SprayTest extends App with SimpleRoutingApp {
implicit val system = ActorSystem("my-system")
startServer(interface = "localhost", port = 8080) {
path("hello") {
get {
complete {
<h1>Say hello to spray</h1>
}
}
}
}
}
However, I receive the following compile errors:
Multiple markers at this line
- not found: value port
- bad symbolic reference to spray.can encountered in class file 'SimpleRoutingApp.class'. Cannot
access term can in package spray. The current classpath may be missing a definition for spray.can, or
SimpleRoutingApp.class may have been compiled against a version that's incompatible with the one
found on the current classpath.
- not found: value interface
Does anyone know what might be the issue? BTW, I'm very new to spray and actors, so I lack a lot of intuition for how spray and actors work (that's why I'm doing this simple tutorial).
Finally found the answer myself. I needed to add the spray-can dependency to my pom file. Leaving this question and answer in case anyone else runs into the same problem.
SBT example:
scalaVersion := "2.10.4"
val akkaVersion = "2.3.6"
val sprayVersion = "1.3.2"
resolvers ++= Seq(
"Spray Repository" at "http://repo.spray.io/"
)
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion
)