Compilation errors with spark cassandra connector and SBT - scala

I'm trying to get the DataStax spark cassandra connector working. I've created a new SBT project in IntelliJ, and added a single class. The class and my sbt file is given below. Creating spark contexts seem to work, however, the moment I uncomment the line where I try to create a cassandraTable, I get the following compilation error:
Error:scalac: bad symbolic reference. A signature in CassandraRow.class refers to term catalyst
in package org.apache.spark.sql which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling CassandraRow.class.
Sbt is kind of new to me, and I would appreciate any help in understanding what this error means (and of course, how to resolve it).
name := "cassySpark1"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.1.0-alpha2" withSources() withJavadoc()
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
And my class:
import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._
object HelloWorld { def main(args:Array[String]): Unit ={
System.setProperty("spark.cassandra.query.retry.count", "1")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "cassandra-hostname")
.set("spark.cassandra.username", "cassandra")
.set("spark.cassandra.password", "cassandra")
val sc = new SparkContext("local", "testingCassy", conf)
> //val foo = sc.cassandraTable("keyspace name", "table name")
val rdd = sc.parallelize(1 to 100)
val sum = rdd.reduce(_+_)
println(sum) } }

You need to add spark-sql to dependencies list
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.1.0"

Add library dependency in your project's pom.xml file. It seems they have changed the Vector.class dependencies location in the new refactoring.

Related

Error while running sbt package: object apache is not a member of package org

When I try sbt package in my below code I get these following errors
object apache is not a member of package org
not found: value SparkSession
MY Spark Version: 2.4.4
My Scala Version: 2.11.12
My build.sbt
name := "simpleApp"
version := "1.0"
scalaVersion := "2.11.12"
//libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"
libraryDependencies ++= {
val sparkVersion = "2.4.4"
Seq( "org.apache.spark" %% "spark-core" % sparkVersion)
}
my Scala project
import org.apache.spark.sql.SparkSession
object demoapp {
def main(args: Array[String]) {
val logfile = "C:/SUPPLENTA_INFORMATICS/demo/hello.txt"
val spark = SparkSession.builder.appName("Simple App in Scala").getOrCreate()
val logData = spark.read.textFile(logfile).cache()
val numAs = logData.filter(line => line.contains("Washington")).count()
println(s"Lines are: $numAs")
spark.stop()
}
}
If you want to use Spark SQL, you also have to add the spark-sql module to the dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
Also, note that you have to reload your project in SBT after changing the build definition and import the changes in intelliJ.

Understanding build.sbt with sbt-spark-package plugin

I am new the scala and SBT build files. From the introductory tutorials adding spark dependencies to a scala project should be straight-forward via the sbt-spark-package plugin but I am getting the following error:
[error] (run-main-0) java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
Please provide resources to learn more about what could be driving error as I want to understand process more thoroughly.
CODE:
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark citation graph")
.getOrCreate()
}
val sc = spark.sparkContext
}
import org.apache.spark.graphx.GraphLoader
object Test extends SparkSessionWrapper {
def main(args: Array[String]) {
println("Testing, testing, testing, testing...")
var filePath = "Desktop/citations.txt"
val citeGraph = GraphLoader.edgeListFile(sc, filepath)
println(citeGraph.vertices.take(1))
}
}
plugins.sbt
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
build.sbt -- WORKING. Why does libraryDependencies run/work ?
spName := "yewno/citation_graph"
version := "0.1"
scalaVersion := "2.11.12"
sparkVersion := "2.2.0"
sparkComponents ++= Seq("core", "sql", "graphx")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0",
"org.apache.spark" %% "spark-graphx" % "2.2.0"
)
build.sbt -- NOT WORKING. Would expect this to compile & run correctly
spName := "yewno/citation_graph"
version := "0.1"
scalaVersion := "2.11.12"
sparkVersion := "2.2.0"
sparkComponents ++= Seq("core", "sql", "graphx")
Bonus for explanation + links to resources to learn more about SBT build process, jar files, and anything else that can help me get up to speed!
sbt-spark-package plugin provides dependencies in provided scope:
sparkComponentSet.map { component =>
"org.apache.spark" %% s"spark-$component" % sparkVersion.value % "provided"
}.toSeq
We can confirm this by running show libraryDependencies from sbt:
[info] * org.scala-lang:scala-library:2.11.12
[info] * org.apache.spark:spark-core:2.2.0:provided
[info] * org.apache.spark:spark-sql:2.2.0:provided
[info] * org.apache.spark:spark-graphx:2.2.0:provided
provided scope means:
The dependency will be part of compilation and test, but excluded from
the runtime.
Thus sbt run throws java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
If we really want to include provided dependencies on run classpath then #douglaz suggests:
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated

java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterUtils$

I was building this small demo code for Spark streaming using twitter. I have added the required dependencies as shown by http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ and I am using sbt to build jars. The project build successfully and only problem seems to be is- it is not able to find the TwitterUtils class.
The scala code is given below
build.sbt
name := "twitterexample"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0",
"org.twitter4j" % "twitter4j-core" % "4.0.4",
"org.twitter4j" % "twitter4j-stream" % "4.0.4"
)
The main scala file is
TwitterCount.scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TwitterCount {
def main(args: Array[String]): Unit = {
val consumerKey = "abc"
val consumerSecret ="abc"
val accessToken = "abc"
val accessTokenSecret = "abc"
val lang ="english"
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret",consumerSecret)
System.setProperty("twitter4j.oauth.accessToken",accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
val conf = new SparkConf().setAppName("TwitterHashTags")
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc,None)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.map{status => status.split("""\s+""")}
val hashTags = words.filter{ word => word.startsWith("#StarWarsDay")}
val hashcounts = hashTags.count()
hashcounts.print()
ssc.start
ssc.awaitTermination()
}
Then I am building the project using
sbt package
and I submitting the generated jars using
spark-submit --class "TwitterCount" --master local[*] target/scala-2.11/twitterexample_2.11-1.0.jar
Please help me with this.
Thanks
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
You are missing package name in your code. Your spark submit command should be like this.
--class com.spark.examples.TwitterCount
I found the solution at last.
java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
I have to build the jars using
sbt assembly
but I'm still wondering what's the difference in jars that I make using
sbt package
anyone knows? plz share

updateStateByKey, noClassDefFoundError

I have problem with using updateStateByKey() function. I have following, simple code (written base on book: "Learning Spark - Lighting Fast Data Analysis"):
object hello {
def updateStateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
Some(runningCount.getOrElse(0) + newValues.size)
}
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[5]").setAppName("AndrzejApp")
val ssc = new StreamingContext(conf, Seconds(4))
ssc.checkpoint("/")
val lines7 = ssc.socketTextStream("localhost", 9997)
val keyValueLine7 = lines7.map(line => (line.split(" ")(0), line.split(" ")(1).toInt))
val statefullStream = keyValueLine7.updateStateByKey(updateStateFunction _)
ssc.start()
ssc.awaitTermination()
}
}
My build.sbt is:
name := "stream-correlator-spark"
version := "1.0"
scalaVersion := "2.11.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.3.1" % "provided"
)
When I build it with sbt assembly command everything goes fine. When I run this on spark cluster in local mode I got error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/dstream/DStream$
at hello$.main(helo.scala:25)
...
line 25 is:
val statefullStream = keyValueLine7.updateStateByKey(updateStateFunction _)
I feel this might be some compatibility version problem but I don't know what might be the reason and how to resolve this.
I would be really grateful for help!
When you are writing "provided" in the SBT this means exactly that your library is provided by the environment and need no to be included in the package.
Try to remove "provided" mark from "spark-streaming" library.
You can add "provided" back when you need to submit your app to a spark cluster to run. The benefit of having "provided" is that the result fat jar will not include classes from the provided dependencies, which will yield a much smaller fat jar, comparing to not having "provided". In my case, the result jar will be around 90M without "provided" and then shrink to 30+M with "provided".

Error in hello world spray app with scala 2.11

I'm trying to get a simple "hello world" server running using spray with scala 2.11:
import spray.routing.SimpleRoutingApp
import akka.actor.ActorSystem
object SprayTest extends App with SimpleRoutingApp {
implicit val system = ActorSystem("my-system")
startServer(interface = "localhost", port = 8080) {
path("hello") {
get {
complete {
<h1>Say hello to spray</h1>
}
}
}
}
}
However, I receive the following compile errors:
Multiple markers at this line
- not found: value port
- bad symbolic reference to spray.can encountered in class file 'SimpleRoutingApp.class'. Cannot
access term can in package spray. The current classpath may be missing a definition for spray.can, or
SimpleRoutingApp.class may have been compiled against a version that's incompatible with the one
found on the current classpath.
- not found: value interface
Does anyone know what might be the issue? BTW, I'm very new to spray and actors, so I lack a lot of intuition for how spray and actors work (that's why I'm doing this simple tutorial).
Finally found the answer myself. I needed to add the spray-can dependency to my pom file. Leaving this question and answer in case anyone else runs into the same problem.
SBT example:
scalaVersion := "2.10.4"
val akkaVersion = "2.3.6"
val sprayVersion = "1.3.2"
resolvers ++= Seq(
"Spray Repository" at "http://repo.spray.io/"
)
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion
)