Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging - scala

I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured.
Running commands:
me#Mycomputer:~/spark-2.1.0$ bin/spark-submit --class ETLApp /home/me/src/etl/target/scala-2.10/etl-assembly-0.1.0.jar
Output:
...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
build.sbt:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
javacOptions ++= Seq("-source", "1.8", "-target", "1.8")
mainClass := Some("ETLApp")
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.5.2";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2";
libraryDependencies += "org.apache.curator" % "curator-recipes" % "2.6.0"
libraryDependencies += "org.apache.curator" % "curator-test" % "2.6.0"
libraryDependencies += "args4j" % "args4j" % "2.32"
java -version
java version "1.8.0_101"
scala -version
2.10.5
spark version
2.1.0
Any hints welcomed. Thanks

in that case, your jar must bring all dependend classes along when being submitted to spark.
in maven this would be possible with the assembly plugin and the jar-with-dependencies descriptor. with sbt a quick google found this: https://github.com/sbt/sbt-assembly

you can change your build.sbt as follows:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
scalacOptions ++= Seq("-deprecation",
"-feature",
"-Xfuture",
"-encoding",
"UTF-8",
"-unchecked",
"-language:postfixOps")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-sql" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming-kafka" % "1.5.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2",
"org.apache.curator" % "curator-recipes" % "2.6.0",
"org.apache.curator" % "curator-test" % "2.6.0",
"args4j" % "args4j" % "2.32")
mainClass in assembly := Some("your.package.name.ETLApp")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
add the sbt-assembly plugin to your plugins.sbt file under the project directory in your Project's Root directory. Running sbt assembly in the Terminal(Linux) or CMD(Windows) in the root directory of your project would download all the dependencies for you and create an U

Related

Log4j vulnerability while using Scala and Spark with sbt

I am working on a scala spark project.
I am using below dependencies:
libraryDependencies ++=
Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" ,
"org.apache.spark" %% "spark-sql" % "2.2.0" ,
"org.apache.spark" %% "spark-hive" % "2.2.0"
),
with scalaVersion set to :
ThisBuild / scalaVersion := "2.11.8"
and i am getting below error:
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.logging.log4j#log4j-api;2.11.1: Resolution failed several times for dependency: org.apache.logging.log4j#log4j-api;2.11.1 {compile=[compile(*), master(*)], runtime=[runtime(*)]}::
[error] typesafe-ivy-releases: unable to get resource for org.apache.logging.log4j#log4j-api;2.11.1: res=https://repo.typesafe.com/typesafe/ivy-releases/org.apache.logging.log4j/log4j-api/2.11.1/ivys/ivy.xml: java.io.IOException: Unexpected response code for CONNECT: 403
[error] sbt-plugin-releases: unable to get resource for org.apache.logging.log4j#log4j-api;2.11.1: res=https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.apache.logging.log4j/log4j-api/2.11.1/ivys/ivy.xml: java.io.IOException: Unexpected response code for CONNECT: 403
Security team has reached out to us to delete the vulnerable log4j-core jar. After which the projects which are using it as transitive dependencies are failing.
Is there a way on just upgrading the log4j version without upgrading scala or spark versions?
It should be a way where i can force the compiler to not fetch log4j-core jar of previous version which is vulnerable and in its place can use 2.17.2 version which is not vulnerable.
I have tried :
dependencyOverrides += "org.apache.logging.log4j" % "log4j-core" % "2.17.2"
also i have excludeAll option in sbt with spark dependencies but both solutions didnt worked out for me.
I just made few updates:
Added below settings to my sbt project:
Updated below settings to use a newer version: in build.properties and assembly.sbt respectively
sbt.version=1.6.2
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0")
Added the log4j dependencies on the top so that any transitive dependency now can use a newer version.
Given below is the sample snippet of one of my project:
name := "project name"
version := "0.1"
scalaVersion := "2.11.8"
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.**" -> "shaded.#1").inAll
)
lazy val root = (project in file(".")).settings(
test in assembly := {}
)
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.17.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.17.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.17.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.0" % "provided"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.0" % "test"
libraryDependencies += "com.typesafe" % "config" % "1.3.1"
libraryDependencies += "org.scalaj" %% "scalaj-http" % "2.4.0"
Below should be provided only in case of conflicts between dependencies if there are any:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case PathList("org", "slf4j", xs#_*) => MergeStrategy.first
case x => MergeStrategy.first
}

Why does sbt assembly in Spark project fail with "Please add any Spark dependencies by supplying the sparkVersion and sparkComponents"?

I work on a sbt-managed Spark project with spark-cloudant dependency. The code is available on GitHub (on spark-cloudant-compile-issue branch).
I've added the following line to build.sbt:
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
And so build.sbt looks as follows:
name := "Movie Rating"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies ++= {
val sparkVersion = "1.6.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"org.apache.kafka" % "kafka-log4j-appender" % "0.9.0.0",
"org.apache.kafka" % "kafka-clients" % "0.9.0.0",
"org.apache.kafka" %% "kafka" % "0.9.0.0",
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
)
}
assemblyMergeStrategy in assembly := {
case PathList("org", "apache", "spark", xs # _*) => MergeStrategy.first
case PathList("scala", xs # _*) => MergeStrategy.discard
case PathList("META-INF", "maven", "org.slf4j", xs # _* ) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
unmanagedBase <<= baseDirectory { base => base / "lib" }
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
When I execute sbt assembly I get the following error:
java.lang.RuntimeException: Please add any Spark dependencies by
supplying the sparkVersion and sparkComponents. Please remove:
org.apache.spark:spark-core:1.6.0:provided
Probably related: https://github.com/databricks/spark-csv/issues/150
Can you try adding spIgnoreProvided := true to your build.sbt?
(This might not be the answer and I could have just posted a comment but I don't have enough reputation)
NOTE I still can't reproduce the issue, but think it does not really matter.
java.lang.RuntimeException: Please add any Spark dependencies by supplying the sparkVersion and sparkComponents.
In your case, your build.sbt misses a sbt resolver to find spark-cloudant dependency. You should add the following line to build.sbt:
resolvers += "spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
PROTIP I strongly recommend using spark-shell first and only when you're comfortable with the package switch to sbt (esp. if you're new to sbt and perhaps other libraries/dependencies too). It's too much to digest in one bite. Follow https://spark-packages.org/package/cloudant-labs/spark-cloudant.

build.sbt: how to add spark dependencies

Hello I am trying to download spark-core, spark-streaming, twitter4j, and spark-streaming-twitter in the build.sbt file below:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.3",
"org.twitter4j" % "twitter4j-stream" % "3.0.3"
)
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
I simply took this libraryDependencies online so I am not sure which versions, etc. to use.
Can someone please explain to me how I should fix this .sbt files. I spent a couple hours trying to figure it out but none of the suggesstion worked. I installed scala through homebrew and I am on version 2.11.8
All of my errors were about:
Modules were resolved with conflicting cross-version suffixes.
The problem is that you are mixing Scala 2.11 and 2.10 artifacts. You have:
scalaVersion := "2.11.8"
And then:
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
Where the 2.10 artifact is being required. You are also mixing Spark versions instead of using a consistent version:
// spark 1.6.1
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
// spark 1.4.1
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
// spark 0.9.0-incubating
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
Here is a build.sbt that fixes both problems:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion
)
You also don't need to manually add twitter4j dependencies since they are added transitively by spark-streaming-twitter.
It works for me:
name := "spark_local"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.5",
"org.twitter4j" % "twitter4j-stream" % "3.0.5",
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"org.apache.spark" %% "spark-mllib" % "2.0.0",
"org.apache.spark" %% "spark-streaming" % "2.0.0"
)

assemblyMergeStrategy causing scala.MatchError when compiling

I'm new to sbt/assembly. I'm trying to resolve some dependency problems, and it seems the only way to do it is through a custom merge strategy. However, whenever I try to add a merge strategy I get a seemingly random MatchError on compiling:
[error] (*:assembly) scala.MatchError: org/apache/spark/streaming/kafka/KafkaUtilsPythonHelper$$anonfun$13.class (of class java.lang.String)
I'm showing this match error for the kafka library, but if I take out that library altogether, I get a MatchError on another library. If I take out all the libraries, I get a MatchError on my own code. None of this happens if I take out the "assemblyMergeStrategy" block. I'm clearly missing something incredibly basic, but for the life of me I can't find it and I can't find anyone else that has this problem. I've tried the older mergeStrategy syntax, but as far as I can read from the docs and SO, this is the proper way to write it now. Please help?
Here is my project/assembly.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
And my project.sbt file:
name := "Clerk"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.1" % "provided",
"org.apache.kafka" %% "kafka" % "0.8.2.1",
"ch.qos.logback" % "logback-classic" % "1.1.7",
"net.logstash.logback" % "logstash-logback-encoder" % "4.6",
"com.typesafe.scala-logging" %% "scala-logging" % "3.1.0",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.1",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.1").
exclude("org.spark-project.spark", "unused")
)
assemblyMergeStrategy in assembly := {
case PathList("org.slf4j", "impl", xs # _*) => MergeStrategy.first
}
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
You're missing a default case for your merge strategy pattern match:
assemblyMergeStrategy in assembly := {
case PathList("org.slf4j", "impl", xs # _*) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Why sbt compile doesn't copy unmanaged resources to classpath?

Could you tell me why sbt compile doesn't copy unmanaged resources to classpath? On the other hand sbt package does. As result I can't start debugging unless I invoke package call manually :(
I'm using SBT 0.12.1
Below is my build.sbt.
import AssemblyKeys._ // put this at the top of the file
net.virtualvoid.sbt.graph.Plugin.graphSettings
assemblySettings
organization := "com.zzz"
version := "0.1"
scalaVersion := "2.10.2"
scalacOptions := Seq("-unchecked", "-language:reflectiveCalls,postfixOps,implicitConversions", "-deprecation", "-feature", "-encoding", "utf8")
unmanagedResourceDirectories in Compile <++= baseDirectory { base =>
Seq( base / "src/main/webapp" )
}
jarName in assembly := "zzz.jar"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case "rootdoc.txt" => MergeStrategy.discard
case x => old(x)
}
}
mainClass in assembly := Some("com.zzz.Boot")
name := "zzz"
// disable using the Scala version in output paths and artifacts
crossPaths := false
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
artifact.name + "." + artifact.extension
}
resolvers ++= Seq(
"spray repo" at "http://repo.spray.io/"
)
libraryDependencies ++= {
val sprayVersion = "1.2-M8"
val akkaVersion = "2.2.0-RC1"
Seq(
"io.spray" % "spray-servlet" % sprayVersion withSources(),
"io.spray" % "spray-can" % sprayVersion withSources(),
"io.spray" % "spray-routing" % sprayVersion withSources(),
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"org.eclipse.jetty" % "jetty-webapp" % "8.1.7.v20120910" % "container",
"org.eclipse.jetty.orbit" % "javax.servlet" % "3.0.0.v201112011016" % "container" artifacts Artifact("javax.servlet", "jar", "jar"),
"net.sourceforge.htmlcleaner" % "htmlcleaner" % "2.2",
"org.apache.httpcomponents" % "httpclient" % "4.2.3",
"org.apache.commons" % "commons-lang3" % "3.1",
"org.mongodb" %% "casbah" % "2.6.0",
"com.romix.scala" % "scala-kryo-serialization" % "0.2-SNAPSHOT",
"org.codehaus.jettison" % "jettison" % "1.3.3",
"com.osinka.subset" %% "subset" % "2.0.1",
"io.spray" %% "spray-json" % "1.2.5" intransitive()
)
}
seq(Revolver.settings: _*)
seq(webSettings: _*)
seq(Twirl.settings: _*)
The job of compile is to compile sources, so it won't typically do anything related to processing resources. However, the resources need to be in the class directory for run, package, test, console, and anything else that uses the fullClasspath. This is done by fullClasspath combining exportedProducts, which are the classes and resources generated by the current project, and dependencyClasspath, which are the classpath entries from dependencies.
The appropriate solution depends on what needs the resources. From the command line, run exported-products to do a compile as well as copy-resources. Programmatically, you will typically want to depend on fullClasspath or exportedProducts.
As a side note, you can typically find out what tasks do what using the inspect command.