spark-submit error: ClassNotFoundException - scala

build.sbt
lazy val commonSettings = Seq(
organization := "com.me",
version := "0.1.0",
scalaVersion := "2.11.0"
)
lazy val counter = (project in file("counter")).
settings(commonSettings:_*)
counter/build.sbt
name := "counter"
mainClass := Some("Counter")
scalaVersion := "2.11.0"
val sparkVersion = "2.1.1";
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.2";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-8" % sparkVersion;
libraryDependencies += "com.github.scopt" %% "scopt" % "3.5.0";
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.1";
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test";
mergeStrategy in assembly := {
case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
case x => (mergeStrategy in assembly).value(x)
}
counter.scala:
object Counter extends SignalHandler
{
var ssc : Option[StreamingContext] = None;
def main( args: Array[String])
Run
./spark-submit --class "Counter" --master spark://10.1.204.67:6066 --deploy-mode cluster file://counter-assembly-0.1.0.jar
Error:
17/06/21 19:00:25 INFO Utils: Successfully started service 'Driver' on port 50140.
17/06/21 19:00:25 INFO WorkerWatcher: Connecting to worker spark://Worker#10.1.204.57:52476
Exception in thread "main" java.lang.ClassNotFoundException: Counter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:56)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Any idea? Thanks
UPDATE
I had the problem here Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException. Now, I copied the jar into spark-2.1.0-bin-hadoop2.7/bin and then run ./spark-submit --class "Counter" --master spark://10.1.204.67:6066 --deploy-mode cluster file://Counter-assembly-0.1.0.jar
The spark cluster is of 2.1.0
But the jar was assembled in 2.1.1 and Scala 2.11.0.

It appears that you've just started developing Spark applications with Scala so for the only purpose to help you and the other future Spark developers, I hope to give you enough steps to get going with the environment.
Project Build Configuration - build.sbt
It appears that you use multi-project sbt build and that's why you have two build.sbts. For the purpose of fixing your issue I'd pretend you don't use this advanced sbt setup.
It appears that you use Spark Streaming so define it as a dependency (as libraryDependencies). You don't have to define the other Spark dependencies (like spark-core or spark-sql).
You should have build.sbt as follows:
organization := "com.me"
version := "0.1.0"
scalaVersion := "2.11.0"
val sparkVersion = "2.1.1"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided"
Building Deployable Package
With build.sbt above, you execute sbt package to build a deployable Spark application package that you eventually spark-submit to a Spark cluster.
You don't have to use sbt assembly for that...yet. I can see that you use Spark Cassandra Connector and other dependencies that could also be defined using --packages or --jars instead (which by themselves have their pros and cons).
sbt package
The size of the final target/scala-2.11/counter_2.11-0.1.0.jar is going to be much smaller than counter-assembly-0.1.0.jar you have built using sbt assembly because sbt package does not include the dependencies in a single jar file. That's expected and fine.
Submitting Spark Application - spark-submit
After sbt package you should have the deployable package in target/scala-2.11 as counter-assembly-0.1.0.jar.
You should just spark-submit with required options which in your case would be:
spark-submit \
--master spark://10.1.204.67:6066
target/scala-2.11/counter-assembly-0.1.0.jar
That's it.
Please note that:
--deploy-mode cluster is too advanced for the exercise (let's keep it simple and bring it back when needed)
file:// makes things broken (or at least is superfluous)
--class "Counter" is taken care of by sbt package when you have a single Scala application in a project where you execute it. You can safely skip it.

Related

Error initializing SparkContext while running job on intelliJ

I've been trying to solve this problem more than a week. First it was like:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
at fcr_104_106.JOB_CALCULATOR_104_106$.main(JOB_CALCULATOR_104_106.scala:25)
at fcr_104_106.JOB_CALCULATOR_104_106.main(JOB_CALCULATOR_104_106.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
After editing configuration and adding dependencies with "provided" next error was this:
22/04/20 11:48:56 INFO SparkContext: Running Spark version 2.3.2
22/04/20 11:48:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/04/20 11:48:56 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
I also tried different versions of hadoop and jdk's, change environmental variable, but nothing worked. I don't now what else can i try. The code should work without a doubt but on my pc it doesn't. Also here is my built.sbt(I also tried to change "provided" to "compile", but it did nothing.
val sparkVersion = "2.3.2"
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "TSC",
scalaVersion := "2.11.12",
version := "0.1.0-SNAPSHOT"
)),
name := "kantor_fcr",
assemblyOutputPath in assembly := file("lib/kantor_fcr.jar"),
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion %Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion %Provided,
"org.apache.spark" %% "spark-hive" % sparkVersion %Provided,
"org.apache.spark" %% "spark-yarn" % sparkVersion %Provided
)
)
Thanks everyone!

Building Scala JAR file Spark

My build.sbt file has (Im using IntelliJ)
scalaVersion := "2.11.8"
resolvers += "MavenRepository" at "http://central.maven.org/maven2"
resolvers += "spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
libraryDependencies ++= {
val sparkVersion = "2.2.1"
Seq( "org.apache.spark" %% "spark-core" % sparkVersion )
}
Im trying to build a JAR and deploy it into Spark. I have issued following commands
sbt compile
sbt assembly
Compilation was successful but assembly failed with following error message
java.lang.RuntimeException: Please add any Spark dependencies by supplying the sparkVersion and sparkComponents. Please remove: org.apache.spark:spark-core:2.2.1
I tried to add "provided" to keep it out that time compilation itself fails as "provided" key word
doesnt include those JARs
What is the mistake am doing?
You first need to add plugin and dependencies for assembly which will create jar for you.
In plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
Add this in your build.sbt
mainClass := Some("name of jar")
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
You can refer my github to create jar and deploy

Getting UnsupportedClassVersionError while running Scala jar file using spark2-submit in Cloudera VM

I'm trying to run a Scala project from here involving Azure Event hub in a Cloudera VM installed locally with a single node. I'm using CDH 5.10. I built the jar file using sbt 0.13.15 which uses Openjdk 1.8.0. Also Oracle Jdk 1.8 is installed in my VM which is being used by spark2 while running jar file I think. The VM didn't have spark2 initially. I upgraded it using Cloudera Manager 5.11.
I'm getting the following error after the project is run:
java.lang.UnsupportedClassVersionError: com/microsoft/azure/eventhubs/EventData : Unsupported major.minor version 52.0
The error displayed in the console after the jobs are submitted I think and then the code kind of hangs.
I enforced the jvm version to be 1.8 while building the jar. My complete build.sbt-
name := "AzureGeoLogProject"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.scala-lang" % "scala-library" % "2.11.8"
libraryDependencies += "com.microsoft.azure" % "spark-streaming-eventhubs_2.11" % "2.0.3"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.2"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.2"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.0.2"
libraryDependencies += "org.apache.httpcomponents" % "httpclient" % "4.2.5"
libraryDependencies += "com.typesafe" % "config" % "1.3.1"
scalacOptions += "-target:jvm-1.8"
I googled the error but got nothing. Don't know how to proceed from here. Any suggestion would be greatly appreciated.
sudo alternatives --config java
When prompted, choose java(jre) 1.8 and try again

Cannot run jar file created from Scala file

This the code that I have written in Scala.
object Main extends App {
println("Hello World from Scala!")
}
This is my build.sbt.
name := "hello-world"
version := "1.0"
scalaVersion := "2.11.5"
mainClass := Some("Main")
This is the command that I have run to create the jar file.
sbt package
My problem is that a jar file named hello-world_2.11-1.0.jar has been created at target/scala-2.11. But I cannot run the file. It is giving me an error saying NoClassDefFoundError.
What am I doing wrong?
It also says what class is not found. Most likely you aren't including scala-library.jar. You can run scala target/scala-2.11/hello-world_2.11-1.0.jar if you have Scala 2.11 available from the command line or java -cp "<path to scala-library.jar>:target/scala-2.11/hello-world_2.11-1.0.jar" Main (use ; instead of : on Windows).
The procedure depicted proves valid up to the way the jar file is executed. From target/scala-2.11 try running it with
scala hello-world_2.11-1.0.jar
Check whether it is runnable also from the project root folder with sbt run.
To run the jar file(containing scala code) with multiple main classes use following approach
scala -cp "<jar-file>.jar;<other-dependencies>.jar" com.xyz.abc.TestApp
This command will take care of including scala-library.jar in dependency and will also identify TestApp as main class if it has a def main(args:Array[String]) method. Please note that multiple jar files should be separated by semi-colon(";")
We can use sbt-assembly to package and run the application.
First, create or add the plugin to project/plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")
The sample build.sbt looks like below:
name := "coursera"
version := "0.1"
scalaVersion := "2.12.10"
mainClass := Some("Main")
val sparkVersion = "3.0.0-preview2"
val playVersion="2.8.1"
val jacksonVersion="2.10.1"
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % scalaVersion.toString(),
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.typesafe.play" %% "play-json" % playVersion,
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
// https://mvnrepository.com/artifact/org.mongodb/casbah
"org.mongodb" %% "casbah" % "3.1.1" pomOnly(),
// https://mvnrepository.com/artifact/com.typesafe/config
"com.typesafe" % "config" % "1.2.1"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
From console, we can run sbt assembly and the jar file gets created in target/scala-2.12/ path.
sbt assembly will create a fat jar. Here is an excerpt from the documentation :
sbt-assembly is a sbt plugin originally ported from codahale's assembly-sbt, which I'm guessing was inspired by Maven's assembly plugin. The goal is simple: Create a fat JAR of your project with all of its dependencies.

How to deploy TypeSafe Activator based application to an Apache Spark cluster?

My application uses Apache Spark for background data processing and Play Framework for the front end interface.
The best method to use the Play Framework in a Scala application to use it with TypeSafe activator.
Now, the problem is that I want to deploy this application to a spark cluster.
There is good documentation as to how a person can deploy an SBT application to a cluster using spark-submit, but what to do with an activator based application?
Please note that I understand how to use Spark with activator using this link, my question is specifically about deploying the application on a cluster such as EC2 etc.
The application, by the way, is written in Scala.
I'm open to suggestions such as decoupling the two applications and allowing them to interact. Except I don't know how to do that, so if you're suggesting that a reference would be very much appreciated.
Update:
I have tried adding dependencies to build.sbt file in an activator project and I get the following error:
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[error] impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1
Here is how I added dependencies in build.sbt file:
// All the apache spark dependencies
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % sparkVersion % "provided" withSources(),
"org.apache.spark" % "spark-sql_2.10" % sparkVersion % "provided" withSources(),
"org.apache.spark" % "spark-streaming_2.10" % sparkVersion % "provided" withSources(),
"org.apache.spark" % "spark-mllib_2.10" % sparkVersion % "provided" withSources()
)
and the resolvers:
// All the Apache Spark resolvers
resolvers ++= Seq(
"Apache repo" at "https://repository.apache.org/content/repositories/releases",
"Local Repo" at Path.userHome.asFile.toURI.toURL + "/.m2/repository", // Added local repository
Resolver.mavenLocal )
Any workaround?
activator is just sbt with three changes:
a "new" command to create projects from templates
a "ui" command to open a tutorial UI
tries to guess whether to open the ui if you type "activator" by itself. to force command line, use "activator shell"
So everything you read about sbt applies. You can also use sbt with your project if you like, but it's the same thing unless you are using "new" or "ui"
The short answer to your question is probably to use the sbt-native-packager plugin and its "stage" task; the play docs have a deployment section that describes this.
It turns out that the one problem with Play framework and Apache Spark is a dependency conflict which can be easily resolved by excluding the dependency from the Spark dependency list.
// All the apache spark dependencies
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % sparkVersion % "provided" withSources() excludeAll(
ExclusionRule(organization = "org.slf4j")
),
"org.apache.spark" % "spark-sql_2.10" % sparkVersion % "provided" withSources(),
"org.apache.spark" % "spark-streaming_2.10" % sparkVersion % "provided" withSources(),
"org.apache.spark" % "spark-mllib_2.10" % sparkVersion % "provided" withSources()
)
Also, to be used in console, one can easily add the following to build.sbt file in order to have the basic spark packages imported directly.
/// console
// define the statements initially evaluated when entering 'console', 'consoleQuick', or 'consoleProject'
// but still keep the console settings in the sbt-spark-package plugin
// If you want to use yarn-client for spark cluster mode, override the environment variable
// SPARK_MODE=yarn-client <cmd>
val sparkMode = sys.env.getOrElse("SPARK_MODE", "local[2]")
initialCommands in console :=
s"""
|import org.apache.spark.SparkConf
|import org.apache.spark.SparkContext
|import org.apache.spark.SparkContext._
|
|#transient val sc = new SparkContext(
| new SparkConf()
| .setMaster("$sparkMode")
| .setAppName("Console test"))
|implicit def sparkContext = sc
|import sc._
|
|#transient val sqlc = new org.apache.spark.sql.SQLContext(sc)
|implicit def sqlContext = sqlc
|import sqlc._
|
|def time[T](f: => T): T = {
| import System.{currentTimeMillis => now}
| val start = now
| try { f } finally { println("Elapsed: " + (now - start)/1000.0 + " s") }
|}
|
|""".stripMargin
cleanupCommands in console :=
s"""
|sc.stop()
""".stripMargin
Now, the major issue is deployment of the application. By running play framework, launching of the application of multiple nodes on a cluster is troublesome since the HTTP request handler must have one specific URL.
This problem can be solved by starting the Play Framework instance on master node and have the URL pointed to its IP.