Understanding build.sbt with sbt-spark-package plugin - scala

I am new the scala and SBT build files. From the introductory tutorials adding spark dependencies to a scala project should be straight-forward via the sbt-spark-package plugin but I am getting the following error:
[error] (run-main-0) java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
Please provide resources to learn more about what could be driving error as I want to understand process more thoroughly.
CODE:
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark citation graph")
.getOrCreate()
}
val sc = spark.sparkContext
}
import org.apache.spark.graphx.GraphLoader
object Test extends SparkSessionWrapper {
def main(args: Array[String]) {
println("Testing, testing, testing, testing...")
var filePath = "Desktop/citations.txt"
val citeGraph = GraphLoader.edgeListFile(sc, filepath)
println(citeGraph.vertices.take(1))
}
}
plugins.sbt
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
build.sbt -- WORKING. Why does libraryDependencies run/work ?
spName := "yewno/citation_graph"
version := "0.1"
scalaVersion := "2.11.12"
sparkVersion := "2.2.0"
sparkComponents ++= Seq("core", "sql", "graphx")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0",
"org.apache.spark" %% "spark-graphx" % "2.2.0"
)
build.sbt -- NOT WORKING. Would expect this to compile & run correctly
spName := "yewno/citation_graph"
version := "0.1"
scalaVersion := "2.11.12"
sparkVersion := "2.2.0"
sparkComponents ++= Seq("core", "sql", "graphx")
Bonus for explanation + links to resources to learn more about SBT build process, jar files, and anything else that can help me get up to speed!

sbt-spark-package plugin provides dependencies in provided scope:
sparkComponentSet.map { component =>
"org.apache.spark" %% s"spark-$component" % sparkVersion.value % "provided"
}.toSeq
We can confirm this by running show libraryDependencies from sbt:
[info] * org.scala-lang:scala-library:2.11.12
[info] * org.apache.spark:spark-core:2.2.0:provided
[info] * org.apache.spark:spark-sql:2.2.0:provided
[info] * org.apache.spark:spark-graphx:2.2.0:provided
provided scope means:
The dependency will be part of compilation and test, but excluded from
the runtime.
Thus sbt run throws java.lang.NoClassDefFoundError: org/apache/spark/SparkContext
If we really want to include provided dependencies on run classpath then #douglaz suggests:
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated

Related

Scala Flink get java.lang.NoClassDefFoundError: scala/Product$class after using case class for customized DeserializationSchema

It work fine when using generic class.
But get java.lang.NoClassDefFoundError: scala/Product$class error after change class to case class.
Not sure is sbt packaging problem or code problem.
When I'm using:
sbt
scala: 2.11.12
java: 8
sbt assembly to package
package example
import java.util.Properties
import java.nio.charset.StandardCharsets
import org.apache.flink.api.scala._
import org.apache.flink.streaming.util.serialization.{DeserializationSchema, SerializationSchema}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaProducer}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.api.common.typeinfo.TypeInformation
import Config._
case class Record(
id: String,
startTime: Long
) {}
class RecordDeSerializer extends DeserializationSchema[Record] with SerializationSchema[Record] {
override def serialize(record: Record): Array[Byte] = {
return "123".getBytes(StandardCharsets.UTF_8)
}
override def deserialize(b: Array[Byte]): Record = {
Record("1", 123)
}
override def isEndOfStream(record: Record): Boolean = false
override def getProducedType: TypeInformation[Record] = {
createTypeInformation[Record]
}
}
object RecordConsumer {
def main(args: Array[String]): Unit = {
val config : Properties = {
var p = new Properties()
p.setProperty("zookeeper.connect", Config.KafkaZookeeperServers)
p.setProperty("bootstrap.servers", Config.KafkaBootstrapServers)
p.setProperty("group.id", Config.KafkaGroupID)
p
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(1000)
var consumer = new FlinkKafkaConsumer[Record](
Config.KafkaTopic,
new RecordDeSerializer(),
config
)
consumer.setStartFromEarliest()
val stream = env.addSource(consumer).print
env.execute("record consumer")
}
}
Error
2020-08-05 04:07:33,963 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding checkpoint 1670 of job 4de8831901fa72790d0a9a973cc17dde.
java.lang.NoClassDefFoundError: scala/Product$class
...
build.SBT
First idea is that maybe version is not right.
But every thing work fine if use normal class
Here is build.sbt
ThisBuild / resolvers ++= Seq(
"Apache Development Snapshot Repository" at "https://repository.apache.org/content/repositories/snapshots/",
Resolver.mavenLocal
)
name := "deedee"
version := "0.1-SNAPSHOT"
organization := "dexterlab"
ThisBuild / scalaVersion := "2.11.8"
val flinkVersion = "1.8.2"
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-java" % flinkVersion % "provided",
"org.apache.flink" %% "flink-connector-kafka" % flinkVersion,
)
val thirdPartyDependencies = Seq(
"com.github.nscala-time" %% "nscala-time" % "2.24.0",
"com.typesafe.play" %% "play-json" % "2.6.14",
)
lazy val root = (project in file(".")).
settings(
libraryDependencies ++= flinkDependencies,
libraryDependencies ++= thirdPartyDependencies,
libraryDependencies += "org.scala-lang" % "scala-compiler" % scalaVersion.value,
)
assembly / mainClass := Some("dexterlab.TelecoDataConsumer")
// make run command include the provided dependencies
Compile / run := Defaults.runTask(Compile / fullClasspath,
Compile / run / mainClass,
Compile / run / runner
).evaluated
// stays inside the sbt console when we press "ctrl-c" while a Flink programme executes with "run" or "runMain"
Compile / run / fork := true
Global / cancelable := true
// exclude Scala library from assembly
assembly / assemblyOption := (assembly / assemblyOption).value.copy(includeScala = false)
autoCompilerPlugins := true
Finally success after I add this line in build.sbt
assembly / assemblyOption := (assemblu / assemblyOption).value.copy(includeScala = true)
To include scala library when running sbt assembly

Error while running sbt package: object apache is not a member of package org

When I try sbt package in my below code I get these following errors
object apache is not a member of package org
not found: value SparkSession
MY Spark Version: 2.4.4
My Scala Version: 2.11.12
My build.sbt
name := "simpleApp"
version := "1.0"
scalaVersion := "2.11.12"
//libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"
libraryDependencies ++= {
val sparkVersion = "2.4.4"
Seq( "org.apache.spark" %% "spark-core" % sparkVersion)
}
my Scala project
import org.apache.spark.sql.SparkSession
object demoapp {
def main(args: Array[String]) {
val logfile = "C:/SUPPLENTA_INFORMATICS/demo/hello.txt"
val spark = SparkSession.builder.appName("Simple App in Scala").getOrCreate()
val logData = spark.read.textFile(logfile).cache()
val numAs = logData.filter(line => line.contains("Washington")).count()
println(s"Lines are: $numAs")
spark.stop()
}
}
If you want to use Spark SQL, you also have to add the spark-sql module to the dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
Also, note that you have to reload your project in SBT after changing the build definition and import the changes in intelliJ.

NoClassDefFoundError: scala/Product$class in spark app

I am building a Spark application with bash script and I have an only spark-sql and core dependencies in the build.sbt file. So every time I call some rdd methods or convert the data to case class for dataset creation I get this error:
Caused by: java.lang.NoClassDefFoundError: scala/Product$class
I suspect that it is a dependency error. So how should I change my dependencies to fix this?
dependencies list:
import sbt._
object Dependencies {
lazy val scalaCsv = "com.github.tototoshi" %% "scala-csv" % "1.3.5"
lazy val sparkSql = "org.apache.spark" %% "spark-sql" % "2.3.3"
lazy val sparkCore = "org.apache.spark" %% "spark-core" % "2.3.3"
}
build.sbt file:
import Dependencies._
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
scalaVersion := "2.11.12",
version := "test"
)),
name := "project",
libraryDependencies ++= Seq(scalaCsv, sparkSql, sparkCore),
mainClass in (Compile, run) := Some("testproject.spark.Main")
)
I launch spark app with spark 2.3.3 as my spark home directory like this:
#!/bin/sh
$SPARK_HOME/bin/spark-submit \
--class "testproject.spark.Main " \
--master local[*] \
target/scala-2.11/test.jar
Not sure what was the problem exactly, however, I have recreated the project and moved the source code there. The error disappeared

"object github is not a member of package com" for WeKanban2 example project in Scala in action book

I'm new in Scala so started from reading "Scala in Action" book 2013 year. But I cannot build plain sbt WeKanban2 project. I use newer version scala and sbt then in example.
I get error when run sbt:
error: object github is not a member of package com import com.github.siasia.WebPlugin._
my build.properties:
//was sbt.version=0.12.0
sbt.version=0.13.7
plugins.sbt
// was libraryDependencies <+= sbtVersion(v => "com.github.siasia" %% "xsbt-eb-plugin" (v+"-0.2.11.1"))
addSbtPlugin("com.earldouglas" % "xsbt-web-plugin" % "0.7.0")
build.sbt
import com.github.siasia.WebPlugin._
name := "weKanban"
organization := "scalainaction"
version := "0.2"
//was: scalaVersion := "2.10.0"
scalaVersion := "2.11.4"
scalacOptions ++= Seq("-unchecked", "-deprecation")
resolvers ++= Seq(
"Scala-Tools Maven2 Releases Repository" at "http://scala-tools.org/repo-releases",
"Scala-Tools Maven2 Snapshots Repository" at "http://scala-tools.org/repo-snapshots"
)
libraryDependencies ++= Seq(
"org.scalaz" %% "scalaz-core" % scalazVersion,
"org.scalaz" %% "scalaz-http" % scalazVersion,
"org.eclipse.jetty" % "jetty-servlet" % jettyVersion % "container",
"org.eclipse.jetty" % "jetty-webapp" % jettyVersion % "test, container",
"org.eclipse.jetty" % "jetty-server" % jettyVersion % "container",
"com.h2database" % "h2" % "1.2.137",
"org.squeryl" % "squeryl_2.10" % "0.9.5-6"
)
seq(webSettings :_*)
build.scala
import sbt._
import Keys._
object H2TaskManager {
var process: Option[Process] = None
lazy val H2 = config("h2") extend(Compile)
val startH2 = TaskKey[Unit]("start", "Starts H2 database")
val startH2Task = startH2 in H2 <<= (fullClasspath in Compile) map { cp =>
startDatabase(cp.map(_.data).map(_.getAbsolutePath()).filter(_.contains("h2database")))}
def startDatabase(paths: Seq[String]) = {
process match {
case None =>
val cp = paths.mkString(System.getProperty("path.seperator"))
val command = "java -cp " + cp + " org.h2.tools.Server"
println("Starting Database with command: " + command)
process = Some(Process(command).run())
println("Database started ! ")
case Some(_) =>
println("H2 Database already started")
}
}
val stopH2 = TaskKey[Unit]("stop", "Stops H2 database")
val stopH2Task = stopH2 in H2 :={
process match {
case None => println("Database already stopped")
case Some(_) =>
println("Stopping database...")
process.foreach{_.destroy()}
process = None
println("Database stopped...")
}
}
}
object MainBuild extends Build {
import H2TaskManager._
lazy val scalazVersion = "6.0.3"
lazy val jettyVersion = "7.3.0.v20110203"
lazy val wekanban = Project(
"wekanban",
file(".")) settings(startH2Task, stopH2Task)}
What should I do to fix this error? What should I do to start app from Idea 14 in debug mode?
Your import
import com.github.siasia.WebPlugin._
failed to find path github and below. In your plugins.sbt you wrote:
// was libraryDependencies <+= sbtVersion(v => "com.github.siasia" %% "xsbt-eb-plugin" (v+"-0.2.11.1"))
addSbtPlugin("com.earldouglas" % "xsbt-web-plugin" % "0.7.0")
The github-path probably existed in the outcommented com.github.siasia.xsbt-eb-plugin. However the com.earldouglas.xsbt-web-plugin have only the path com.earldouglas.xwp. You should check out if any of the classes there (https://github.com/earldouglas/xsbt-web-plugin/tree/master/src/main/scala) resembles what you need to import.
Probably not related exactly to this package but similar. The library was present in sbt and everything looked ok. But I still was unable to build the project so I went to my $HOME/.ivy2/cache located the dependency directory and deleted it, then went back the $PROJECT_HOME/ and executed sbt build

Compilation errors with spark cassandra connector and SBT

I'm trying to get the DataStax spark cassandra connector working. I've created a new SBT project in IntelliJ, and added a single class. The class and my sbt file is given below. Creating spark contexts seem to work, however, the moment I uncomment the line where I try to create a cassandraTable, I get the following compilation error:
Error:scalac: bad symbolic reference. A signature in CassandraRow.class refers to term catalyst
in package org.apache.spark.sql which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling CassandraRow.class.
Sbt is kind of new to me, and I would appreciate any help in understanding what this error means (and of course, how to resolve it).
name := "cassySpark1"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.1.0-alpha2" withSources() withJavadoc()
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
And my class:
import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._
object HelloWorld { def main(args:Array[String]): Unit ={
System.setProperty("spark.cassandra.query.retry.count", "1")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "cassandra-hostname")
.set("spark.cassandra.username", "cassandra")
.set("spark.cassandra.password", "cassandra")
val sc = new SparkContext("local", "testingCassy", conf)
> //val foo = sc.cassandraTable("keyspace name", "table name")
val rdd = sc.parallelize(1 to 100)
val sum = rdd.reduce(_+_)
println(sum) } }
You need to add spark-sql to dependencies list
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.1.0"
Add library dependency in your project's pom.xml file. It seems they have changed the Vector.class dependencies location in the new refactoring.