I am building a Spark application with bash script and I have an only spark-sql and core dependencies in the build.sbt file. So every time I call some rdd methods or convert the data to case class for dataset creation I get this error:
Caused by: java.lang.NoClassDefFoundError: scala/Product$class
I suspect that it is a dependency error. So how should I change my dependencies to fix this?
dependencies list:
import sbt._
object Dependencies {
lazy val scalaCsv = "com.github.tototoshi" %% "scala-csv" % "1.3.5"
lazy val sparkSql = "org.apache.spark" %% "spark-sql" % "2.3.3"
lazy val sparkCore = "org.apache.spark" %% "spark-core" % "2.3.3"
}
build.sbt file:
import Dependencies._
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
scalaVersion := "2.11.12",
version := "test"
)),
name := "project",
libraryDependencies ++= Seq(scalaCsv, sparkSql, sparkCore),
mainClass in (Compile, run) := Some("testproject.spark.Main")
)
I launch spark app with spark 2.3.3 as my spark home directory like this:
#!/bin/sh
$SPARK_HOME/bin/spark-submit \
--class "testproject.spark.Main " \
--master local[*] \
target/scala-2.11/test.jar
Not sure what was the problem exactly, however, I have recreated the project and moved the source code there. The error disappeared
Related
When I try sbt package in my below code I get these following errors
object apache is not a member of package org
not found: value SparkSession
MY Spark Version: 2.4.4
My Scala Version: 2.11.12
My build.sbt
name := "simpleApp"
version := "1.0"
scalaVersion := "2.11.12"
//libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"
libraryDependencies ++= {
val sparkVersion = "2.4.4"
Seq( "org.apache.spark" %% "spark-core" % sparkVersion)
}
my Scala project
import org.apache.spark.sql.SparkSession
object demoapp {
def main(args: Array[String]) {
val logfile = "C:/SUPPLENTA_INFORMATICS/demo/hello.txt"
val spark = SparkSession.builder.appName("Simple App in Scala").getOrCreate()
val logData = spark.read.textFile(logfile).cache()
val numAs = logData.filter(line => line.contains("Washington")).count()
println(s"Lines are: $numAs")
spark.stop()
}
}
If you want to use Spark SQL, you also have to add the spark-sql module to the dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
Also, note that you have to reload your project in SBT after changing the build definition and import the changes in intelliJ.
I am new the scala and SBT build files. From the introductory tutorials adding spark dependencies to a scala project should be straight-forward via the sbt-spark-package plugin but I am getting the following error:
[error] (run-main-0) java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
Please provide resources to learn more about what could be driving error as I want to understand process more thoroughly.
CODE:
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark citation graph")
.getOrCreate()
}
val sc = spark.sparkContext
}
import org.apache.spark.graphx.GraphLoader
object Test extends SparkSessionWrapper {
def main(args: Array[String]) {
println("Testing, testing, testing, testing...")
var filePath = "Desktop/citations.txt"
val citeGraph = GraphLoader.edgeListFile(sc, filepath)
println(citeGraph.vertices.take(1))
}
}
plugins.sbt
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
build.sbt -- WORKING. Why does libraryDependencies run/work ?
spName := "yewno/citation_graph"
version := "0.1"
scalaVersion := "2.11.12"
sparkVersion := "2.2.0"
sparkComponents ++= Seq("core", "sql", "graphx")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0",
"org.apache.spark" %% "spark-graphx" % "2.2.0"
)
build.sbt -- NOT WORKING. Would expect this to compile & run correctly
spName := "yewno/citation_graph"
version := "0.1"
scalaVersion := "2.11.12"
sparkVersion := "2.2.0"
sparkComponents ++= Seq("core", "sql", "graphx")
Bonus for explanation + links to resources to learn more about SBT build process, jar files, and anything else that can help me get up to speed!
sbt-spark-package plugin provides dependencies in provided scope:
sparkComponentSet.map { component =>
"org.apache.spark" %% s"spark-$component" % sparkVersion.value % "provided"
}.toSeq
We can confirm this by running show libraryDependencies from sbt:
[info] * org.scala-lang:scala-library:2.11.12
[info] * org.apache.spark:spark-core:2.2.0:provided
[info] * org.apache.spark:spark-sql:2.2.0:provided
[info] * org.apache.spark:spark-graphx:2.2.0:provided
provided scope means:
The dependency will be part of compilation and test, but excluded from
the runtime.
Thus sbt run throws java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
If we really want to include provided dependencies on run classpath then #douglaz suggests:
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated
I was building this small demo code for Spark streaming using twitter. I have added the required dependencies as shown by http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ and I am using sbt to build jars. The project build successfully and only problem seems to be is- it is not able to find the TwitterUtils class.
The scala code is given below
build.sbt
name := "twitterexample"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0",
"org.twitter4j" % "twitter4j-core" % "4.0.4",
"org.twitter4j" % "twitter4j-stream" % "4.0.4"
)
The main scala file is
TwitterCount.scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TwitterCount {
def main(args: Array[String]): Unit = {
val consumerKey = "abc"
val consumerSecret ="abc"
val accessToken = "abc"
val accessTokenSecret = "abc"
val lang ="english"
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret",consumerSecret)
System.setProperty("twitter4j.oauth.accessToken",accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
val conf = new SparkConf().setAppName("TwitterHashTags")
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc,None)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.map{status => status.split("""\s+""")}
val hashTags = words.filter{ word => word.startsWith("#StarWarsDay")}
val hashcounts = hashTags.count()
hashcounts.print()
ssc.start
ssc.awaitTermination()
}
Then I am building the project using
sbt package
and I submitting the generated jars using
spark-submit --class "TwitterCount" --master local[*] target/scala-2.11/twitterexample_2.11-1.0.jar
Please help me with this.
Thanks
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
You are missing package name in your code. Your spark submit command should be like this.
--class com.spark.examples.TwitterCount
I found the solution at last.
java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
I have to build the jars using
sbt assembly
but I'm still wondering what's the difference in jars that I make using
sbt package
anyone knows? plz share
I'm trying to get the DataStax spark cassandra connector working. I've created a new SBT project in IntelliJ, and added a single class. The class and my sbt file is given below. Creating spark contexts seem to work, however, the moment I uncomment the line where I try to create a cassandraTable, I get the following compilation error:
Error:scalac: bad symbolic reference. A signature in CassandraRow.class refers to term catalyst
in package org.apache.spark.sql which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling CassandraRow.class.
Sbt is kind of new to me, and I would appreciate any help in understanding what this error means (and of course, how to resolve it).
name := "cassySpark1"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.1.0-alpha2" withSources() withJavadoc()
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
And my class:
import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._
object HelloWorld { def main(args:Array[String]): Unit ={
System.setProperty("spark.cassandra.query.retry.count", "1")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "cassandra-hostname")
.set("spark.cassandra.username", "cassandra")
.set("spark.cassandra.password", "cassandra")
val sc = new SparkContext("local", "testingCassy", conf)
> //val foo = sc.cassandraTable("keyspace name", "table name")
val rdd = sc.parallelize(1 to 100)
val sum = rdd.reduce(_+_)
println(sum) } }
You need to add spark-sql to dependencies list
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.1.0"
Add library dependency in your project's pom.xml file. It seems they have changed the Vector.class dependencies location in the new refactoring.
I have a project foo with two children foo-core and foo-cli, foo-cli depends on foo-core
(I come from Java/Maven and tried to transpose the parent module with 2 submodules architecture).
Following https://github.com/harrah/xsbt/wiki/Full-Configuration, I wrote my project/Build.scala this way:
import sbt._
import Keys._
object MyBuild extends Build {
//Dependencies
val slf4s = "com.weiglewilczek.slf4s" %% "slf4s" % "1.0.6"
val slf4j = "org.slf4j" %% "slf4j-simple" % "1.5.6"
val grizzled = "org.clapper" %% "grizzled-slf4j" % "0.5"
val junit = "junit" % "junit" % "4.8" % "test"
//End dependencies
lazy val root : Project = Project("root", file(".")) aggregate(cli) settings(
mainClass:= Some("Main")
)
lazy val core : Project = Project("core", file("core"), delegates = root :: Nil) settings(
name := "foo-core",
libraryDependencies ++= Seq(grizzled)
)
lazy val cli: Project = Project("cli", file("cli")) dependsOn(core) settings(
name := "foo-cli",
libraryDependencies ++= Seq(grizzled)
)
}
This configuration does not work: grizzled library is not dowloaded when I run sbt reload;sbt +update (as indicated in http://software.clapper.org/grizzled-slf4j/) and thus the "import grizzli._" fail in my core and cli projects when I sbt compile.
Since I'm new to scala/sbt I imagine I'm doing something awful but can't figure why since I'm confused with all sbt 0.7/sbt0.10 conflicting configurations that were suggested
(like Subproject dependencies in SBT).
Any idea? Hint that could help me?
Thanks in advance
That's grizzled, not grizzli you are using as dependency. The import is:
import grizzled._
This works here from console on project cli and project core, with nothing more than the configuration file above.
Are you using SBT 0.10?