Error initializing SparkContext while running job on intelliJ - scala

I've been trying to solve this problem more than a week. First it was like:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
at fcr_104_106.JOB_CALCULATOR_104_106$.main(JOB_CALCULATOR_104_106.scala:25)
at fcr_104_106.JOB_CALCULATOR_104_106.main(JOB_CALCULATOR_104_106.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
After editing configuration and adding dependencies with "provided" next error was this:
22/04/20 11:48:56 INFO SparkContext: Running Spark version 2.3.2
22/04/20 11:48:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/04/20 11:48:56 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
I also tried different versions of hadoop and jdk's, change environmental variable, but nothing worked. I don't now what else can i try. The code should work without a doubt but on my pc it doesn't. Also here is my built.sbt(I also tried to change "provided" to "compile", but it did nothing.
val sparkVersion = "2.3.2"
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "TSC",
scalaVersion := "2.11.12",
version := "0.1.0-SNAPSHOT"
)),
name := "kantor_fcr",
assemblyOutputPath in assembly := file("lib/kantor_fcr.jar"),
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion %Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion %Provided,
"org.apache.spark" %% "spark-hive" % sparkVersion %Provided,
"org.apache.spark" %% "spark-yarn" % sparkVersion %Provided
)
)
Thanks everyone!

Related

build.spark: add spark dependencies

I was trying to download spark-core and spark-sql in the build.sbt file:
name := "spark Test App"
version := "0.1"
organization := "sura.organization"
scalaVersion := "2.11.8"
val sparkVersion := "2.3.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion)
When I was running sbt package, I was getting an error as
/build.sbt]:7: '=' expected.
I was not able to find the error, please help me.
The problem is in this line
val sparkVersion := "2.3.1"
It should be
val sparkVersion = "2.3.1"
Also, you should mark these depenedencies as Provided, because you only need them for compile and local execution (e.g. tests). But in production you will deploy your jar to an Spark cluster, which (obviously) already includes them.
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided
)
PS: Make sure you use the same Spark and Scala versions as your deploy cluster.
BTW, If you need to include other dependencies (e.g. the Mongo Spark connector) you should take a look at sbt-assembly, but be aware that you will need to exclude the Scala standard library from the assembly jar.

not able to import spark mllib in IntelliJ

I am not able to import spark mllib libraries in Intellij for Spark scala project. I am getting a resolution exception.
Below is my sbt.build
name := "ML_Spark"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.1" % "runtime"
I tried to copy/paste the same build.sbt file you provided and i got the following error :
[error] [/Users/pc/testsbt/build.sbt]:3: ';' expected but string literal found.
Actually, the build.sbt is invalid :
intellij error
Having the version and the Scala version in different lines solved the problem for me :
name := "ML_Spark"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.1" % "runtime"
I am not sure that this is the problem you're facing (can you please share the exception you had ?), it might be a problem with the repositories you specified under the .sbt folder in your home directory.
I have met the same problem before. To solve it, I just used the compiled version of mllib instead of the runtime one. Here is my conf:
name := "SparkDemo"
version := "0.1"
scalaVersion := "2.11.12"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.0"
I had a similar issue, but I found a workaround. Namely, you have to add the spark-mllib jar file to your project manually. Indeed, despite my build.sbt file was
name := "example_project"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.0",
"org.apache.spark" %% "spark-sql" % "3.0.0",
"org.apache.spark" %% "spark-mllib" % "3.0.0" % "runtime"
)
I wasn't able to import the spark library with
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.ml._
The solution that worked for me was to add the jar file manually. Specifically,
Download the jar file of the ml library you need (e.g. for spark 3 use https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.12/3.0.0 ).
Follow this link to add the jar file to your intelliJ project: Correct way to add external jars (lib/*.jar) to an IntelliJ IDEA project
Add also the mlib-local jar (https://mvnrepository.com/artifact/org.apache.spark/spark-mllib-local)
If, for some reason, you compile again the build.sbt you need to re-import the jar file again.

spark-submit error: ClassNotFoundException

build.sbt
lazy val commonSettings = Seq(
organization := "com.me",
version := "0.1.0",
scalaVersion := "2.11.0"
)
lazy val counter = (project in file("counter")).
settings(commonSettings:_*)
counter/build.sbt
name := "counter"
mainClass := Some("Counter")
scalaVersion := "2.11.0"
val sparkVersion = "2.1.1";
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.2";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-8" % sparkVersion;
libraryDependencies += "com.github.scopt" %% "scopt" % "3.5.0";
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.1";
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test";
mergeStrategy in assembly := {
case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
case x => (mergeStrategy in assembly).value(x)
}
counter.scala:
object Counter extends SignalHandler
{
var ssc : Option[StreamingContext] = None;
def main( args: Array[String])
Run
./spark-submit --class "Counter" --master spark://10.1.204.67:6066 --deploy-mode cluster file://counter-assembly-0.1.0.jar
Error:
17/06/21 19:00:25 INFO Utils: Successfully started service 'Driver' on port 50140.
17/06/21 19:00:25 INFO WorkerWatcher: Connecting to worker spark://Worker#10.1.204.57:52476
Exception in thread "main" java.lang.ClassNotFoundException: Counter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:56)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Any idea? Thanks
UPDATE
I had the problem here Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException. Now, I copied the jar into spark-2.1.0-bin-hadoop2.7/bin and then run ./spark-submit --class "Counter" --master spark://10.1.204.67:6066 --deploy-mode cluster file://Counter-assembly-0.1.0.jar
The spark cluster is of 2.1.0
But the jar was assembled in 2.1.1 and Scala 2.11.0.
It appears that you've just started developing Spark applications with Scala so for the only purpose to help you and the other future Spark developers, I hope to give you enough steps to get going with the environment.
Project Build Configuration - build.sbt
It appears that you use multi-project sbt build and that's why you have two build.sbts. For the purpose of fixing your issue I'd pretend you don't use this advanced sbt setup.
It appears that you use Spark Streaming so define it as a dependency (as libraryDependencies). You don't have to define the other Spark dependencies (like spark-core or spark-sql).
You should have build.sbt as follows:
organization := "com.me"
version := "0.1.0"
scalaVersion := "2.11.0"
val sparkVersion = "2.1.1"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided"
Building Deployable Package
With build.sbt above, you execute sbt package to build a deployable Spark application package that you eventually spark-submit to a Spark cluster.
You don't have to use sbt assembly for that...yet. I can see that you use Spark Cassandra Connector and other dependencies that could also be defined using --packages or --jars instead (which by themselves have their pros and cons).
sbt package
The size of the final target/scala-2.11/counter_2.11-0.1.0.jar is going to be much smaller than counter-assembly-0.1.0.jar you have built using sbt assembly because sbt package does not include the dependencies in a single jar file. That's expected and fine.
Submitting Spark Application - spark-submit
After sbt package you should have the deployable package in target/scala-2.11 as counter-assembly-0.1.0.jar.
You should just spark-submit with required options which in your case would be:
spark-submit \
--master spark://10.1.204.67:6066
target/scala-2.11/counter-assembly-0.1.0.jar
That's it.
Please note that:
--deploy-mode cluster is too advanced for the exercise (let's keep it simple and bring it back when needed)
file:// makes things broken (or at least is superfluous)
--class "Counter" is taken care of by sbt package when you have a single Scala application in a project where you execute it. You can safely skip it.

How do you link both spark and kafka in sbt?

I need to use symbols from the kafka project in spark (to use the DefaultDecoder rather than the StringDecoder). Since these symbols are in kafka I need to link both kafka and spark in my sbt project. Here is reduced sbt file that isolates my exact problem:
name := """spark-kafka"""
version := "1.0"
scalaVersion := "2.10.4"
lazy val root = (project in file("."))
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.2.0",
"com.typesafe.scala-logging" %% "scala-logging-slf4j" % "2.1.2",
"org.apache.spark" %% "spark-core" % "1.2.1" % "provided"
)
If I try and build this with sbt compile, I get this error:
> compile
[info] Updating {file:/home/rick/go/src/defend7/sparksprint/tools/consumer/}root...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[error] impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1
[error] Total time: 8 s, completed Feb 21, 2015 1:37:05 PM
This is the same error I get in my less-isolated larger sbt project so that is why I think this smaller sbt file has isolated just the problem I am facing.
I have tried to understand what 'data' sbt is talking about in 'impossible to get artifacts when data has not been loaded' and have also tried some of the common remedies (such as explicitly including slf4j-api 1.6.1 in my libraryDependencies) but this has gotten me nowhere so far.
I'm really stuck and would be very grateful for any help. Thanks!
It looks like a conflict resolution problem somewhere deep in Ivy. It might be fixed by manually excluding slf4j dependency from Kafka and explicitly adding dependency on latest version:
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.2.0" excludeAll(
ExclusionRule(organization = "org.slf4j")
),
"com.typesafe.scala-logging" %% "scala-logging-slf4j" % "2.1.2",
"org.slf4j" % "slf4j-api" % "1.7.10",
"org.apache.spark" %% "spark-core" % "1.2.1" % "provided"
)

Mllib dependency error

I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:
Object Mllib is not a member of package org.apache.spark
Then, I realized that I have to add Mllib as dependency as follow :
version := "1"
scalaVersion :="2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.0",
"org.apache.spark" %% "spark-mllib" % "1.1.0"
)
But, here I got an error that says :
unresolved dependency spark-core_2.10.4;1.1.1 : not found
so I had to modify it to
"org.apache.spark" % "spark-core_2.10" % "1.1.1",
But there is still an error that says :
unresolved dependency spark-mllib;1.1.1 : not found
Anyone knows how to add dependency of Mllib in .sbt file?
As #lmm pointed out, you can instead include the libraries as:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "1.1.0"
)
In sbt %% includes the scala version, and you are building with scala version 2.10.4 whereas the Spark artifacts are published against 2.10 in general.
It should be noted that if you are going to make an assembly jar to deploy your application you may wish to mark spark-core as provided e.g.
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.1.0" % "provided",
"org.apache.spark" % "spark-mllib_2.10" % "1.1.0"
)
Since the spark-core package will be in the path on executor anyways.
Here is another way to add the dependency to your build.sbt file if you're using the Databricks sbt-spark-package plugin:
sparkComponents ++= Seq("sql","hive", "mllib")