Failed to load class in Spark-submit - scala

I'm running a jar file with Spark-submit, but i keep getting this error:
Error: Failed to load class antarctic.DataQuality.
This is the command:
spark-submit --class antarctic.DataQuality --master local[*] --deploy-mode client --jars "path/to/file.jar" "arg 1" "arg 2" "arg 3" "arg 4"
This is the structure of the Scala file:
Scala Structure
The command is being run in target/scala-2.11
I also have this enviroment variables defined:
JAVA_HOME: C:\Program Files\Java\jdk1.8.0_251
HADOOP_HOME: C:\winutils
SPARK_HOME: C:\Users\Spark
SPARK_USER: C:\Users\Spark
build.sbt:
name := "DataQuality"
version := "0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.3.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion
)
// https://github.com/awslabs/deequ
libraryDependencies += "com.amazon.deequ" % "deequ" % "1.0.2-antarctic" from "file:///C:/Users/AKAINIX ANALYTICS/Documents/Lucas/Antarctic/Bitbucket/plataforma-dataquality/Backend/deequ/target/deequ-1.0.2-antarctic.jar"
//https://github.com/spray/spray-json
//libraryDependencies += "io.spray" %% "spray-json" % "1.3.5"
//https://github.com/scopt/scopt
libraryDependencies += "com.github.scopt" %% "scopt" % "4.0.0-RC2"
//https://circe.github.io/circe/
val circeVersion = "0.7.0"
libraryDependencies ++= Seq(
"io.circe" %% "circe-core" % circeVersion,
"io.circe" %% "circe-generic" % circeVersion,
"io.circe" %% "circe-parser" % circeVersion
)
// https://mvnrepository.com/artifact/mysql/mysql-connector-java
libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.19"
// https://mvnrepository.com/artifact/org.postgresql/postgresql
libraryDependencies += "org.postgresql" % "postgresql" % "42.2.5"
//https://github.com/lightbend/config
libraryDependencies += "com.typesafe" % "config" % "1.4.0"
// https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.3.0"

Related

Unable to write files after scala and spark upgrade

My project was previously using Scala version 2.11.12 which I have upgraded to 2.12.10 and the Spark version has been upgraded from 2.4.0 to 3.1.2. See the build.sbt file below with the rest of the project dependencies and versions:
scalaVersion := "2.12.10"
val sparkVersion = "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
libraryDependencies += "org.xerial.snappy" % "snappy-java" % "1.1.4"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.8"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.8" % "test, it"
libraryDependencies += "com.holdenkarau" %% "spark-testing-base" % "3.1.2_1.1.0" % "test, it"
libraryDependencies += "com.github.pureconfig" %% "pureconfig" % "0.12.1"
libraryDependencies += "com.typesafe" % "config" % "1.3.2"
libraryDependencies += "org.pegdown" % "pegdown" % "1.1.0" % "test, it"
libraryDependencies += "com.github.scopt" %% "scopt" % "3.7.1"
libraryDependencies += "com.github.pathikrit" %% "better-files" % "3.8.0"
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.9.2"
libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1" excludeAll (
ExclusionRule(organization = "org.apache.spark")
)
libraryDependencies += "net.liftweb" %% "lift-json" % "3.4.0"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.13.1"
The app is building fine after the upgrade but it is unable to write files to the filesystem which was working fine before the upgrade. I havent made any code changes to the write logic.
The relevant portion of code that writes to the files is shown below.
val inputStream = getClass.getResourceAsStream(resourcePath)
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val output = fs.create(new Path(outputPath))
IOUtils.copyBytes(inputStream, output.getWrappedStream, conf, true)
I am wondering if IOUtils is not compatible with the new Scala/Spark versions?

Unable to start Spark application with Bahir

I am trying to run a Spark application in Scala to connect to ActiveMQ. I am using Bahir for this purpose format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider"). When I am using Bahir2.2 in my built.sbt the application is running fine but on changing it to Bahir3.0 or Bahir4.0 the application is not starting and it is giving an error:
[error] (run-main-0) java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream`
How to fix this? Is there an alternative of Bahir which I can use in my Spark-Structured-Streaming to connect to ActiveMQ topics?
EDIT:
my build.sbt
//For spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0" ,
"org.apache.spark" %% "spark-mllib" % "2.4.0" ,
"org.apache.spark" %% "spark-sql" % "2.4.0" ,
"org.apache.spark" %% "spark-hive" % "2.4.0" ,
"org.apache.spark" %% "spark-streaming" % "2.4.0" ,
"org.apache.spark" %% "spark-graphx" % "2.4.0",
)
//Bahir
libraryDependencies += "org.apache.bahir" %% "spark-sql-streaming-mqtt" % "2.4.0"
Okay, So it seems some kind of compatibility issue between spark2.4 and bahir2.4. I fixed it by rolling back both of them to ver 2.3.
Here is my build.sbt
name := "sparkTest"
version := "0.1"
scalaVersion := "2.11.11"
//For spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0" ,
"org.apache.spark" %% "spark-mllib" % "2.3.0" ,
"org.apache.spark" %% "spark-sql" % "2.3.0" ,
"org.apache.spark" %% "spark-hive" % "2.3.0" ,
"org.apache.spark" %% "spark-streaming" % "2.3.0" ,
"org.apache.spark" %% "spark-graphx" % "2.3.0",
// "org.apache.spark" %% "spark-streaming-kafka" % "1.6.3",
)
//Bahir
libraryDependencies += "org.apache.bahir" %% "spark-sql-streaming-mqtt" % "2.3.0"

How to exclude test dependencies with sbt-assembly

I have an sbt project that I am trying to build into a jar with the sbt-assembly plugin.
build.sbt:
name := "project-name"
version := "0.1"
scalaVersion := "2.11.12"
val sparkVersion = "2.4.0"
libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "3.0.5" % "test",
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test",
// spark-hive dependencies for DataFrameSuiteBase. https://github.com/holdenk/spark-testing-base/issues/143
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.11.513" % "provided",
"com.amazonaws" % "aws-java-sdk-sqs" % "1.11.513" % "provided",
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.513" % "provided",
//"org.apache.hadoop" % "hadoop-aws" % "3.1.1"
"org.json" % "json" % "20180813"
)
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
test in assembly := {}
// https://github.com/holdenk/spark-testing-base
fork in Test := true
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
parallelExecution in Test := false
When I build the project with sbt assembly, the resulting jar contains /org/junit/... and /org/opentest4j/... files
Is there any way to not include these test related files in the final jar?
I have tried replacing the line:
"org.scalatest" %% "scalatest" % "3.0.5" % "test"
with:
"org.scalatest" %% "scalatest" % "3.0.5" % "provided"
I am also wondering how the files are included in the jar as junit is not referenced inside build.sbt (there are junit tests in the project however)?
Updated:
name := "project-name"
version := "0.1"
scalaVersion := "2.11.12"
val sparkVersion = "2.4.0"
val excludeJUnitBinding = ExclusionRule(organization = "junit")
libraryDependencies ++= Seq(
// Provided
"org.apache.spark" %% "spark-core" % sparkVersion % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.11.513" % "provided",
"com.amazonaws" % "aws-java-sdk-sqs" % "1.11.513" % "provided",
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.513" % "provided",
// Test
"org.scalatest" %% "scalatest" % "3.0.5" % "test",
// Necessary
"org.json" % "json" % "20180813"
)
excludeDependencies += excludeJUnitBinding
// https://stackoverflow.com/questions/25144484/sbt-assembly-deduplication-found-error
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
// https://github.com/holdenk/spark-testing-base
fork in Test := true
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
parallelExecution in Test := false
To exclude certain transitive dependencies of a dependency, use the excludeAll or exclude methods.
The exclude method should be used when a pom will be published for the project. It requires the organization and module name to exclude.
For example:
libraryDependencies +=
"log4j" % "log4j" % "1.2.15" exclude("javax.jms", "jms")
The excludeAll method is more flexible, but because it cannot be represented in a pom.xml, it should only be used when a pom doesn’t need to be generated.
For example,
libraryDependencies +=
"log4j" % "log4j" % "1.2.15" excludeAll(
ExclusionRule(organization = "com.sun.jdmk"),
ExclusionRule(organization = "com.sun.jmx"),
ExclusionRule(organization = "javax.jms")
)
In certain cases a transitive dependency should be excluded from all dependencies. This can be achieved by setting up ExclusionRules in excludeDependencies(For sbt 0.13.8 and above).
excludeDependencies ++= Seq(
ExclusionRule("commons-logging", "commons-logging")
)
JUnit jar file downloads as part of below dependencies.
"org.apache.spark" %% "spark-core" % sparkVersion % "provided" //(junit)
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided"// (junit)
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test" //(org.junit)
To exclude junit file please update your dependency as below.
val excludeJUnitBinding = ExclusionRule(organization = "junit")
"org.scalatest" %% "scalatest" % "3.0.5" % "test",
"org.apache.spark" %% "spark-core" % sparkVersion % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test" excludeAll(excludeJUnitBinding)
Update:
Please update your build.abt as below.
resolvers += Resolver.url("bintray-sbt-plugins",
url("https://dl.bintray.com/eed3si9n/sbt-plugins/"))(Resolver.ivyStylePatterns)
val excludeJUnitBinding = ExclusionRule(organization = "junit")
libraryDependencies ++= Seq(
// Provided
"org.apache.spark" %% "spark-core" % sparkVersion % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "provided" excludeAll(excludeJUnitBinding),
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
//"com.amazonaws" % "aws-java-sdk" % "1.11.513" % "provided",
//"com.amazonaws" % "aws-java-sdk-sqs" % "1.11.513" % "provided",
//"com.amazonaws" % "aws-java-sdk-s3" % "1.11.513" % "provided",
// Test
"org.scalatest" %% "scalatest" % "3.0.5" % "test",
// Necessary
"org.json" % "json" % "20180813"
)
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
fork in Test := true
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
parallelExecution in Test := false
plugin.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
I have tried and it's not downloading junit jar file.

Spark dependencies configuration for streaming from Twitter

I am trying to run a Spark application with a Twitter streaming. However, I constantly experiencing problems with dependencies.
When I use org.apache.bahir spark-streaming-twitter dependency I get such an error:
module not found: org.apache.bahir#spark-streaming-twitter;2.0.0
Here is the corresponding build.sbt file:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6"
)
But when I use older streaming dependency I get ClassNotFoundException: : org.apache.spark.Logging error.
Here is the corresponding build.sbt:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6",
"org.apache.spark" %% "spark-streaming-twitter" % "1.6.3"
)
In order to run my application, I run sbt clean and package commands.
So what dependencies should I use and how to configure them to run my application?
Twitter backend has been removed from Spark with 2.0 release and version of bahir you declared doesn't match Spark version. Finally bahir Twitter already comes with twitter4j-stream dependency (4.0.4 at this moment). Use:
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)

spark streaming with sbt + cassandra connector dependency issue

Folks,
I am trying to integrated cassandra with spark streaming. Below is the sbt file:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.0").
exclude("org.spark-project.spark", "unused")
)
I added below line(error line mentioned below) for cassandra integration:
val lines = KafkaUtils.createDirectStream[
String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
//Getting errors once I add below line in program
lines.saveToCassandra("test", "test", SomeColumns("key", "value"))
lines.print()
Once I add above line, I see below error in IDE:
I see similar error if i try to package this project from command prompt:
FYR, I am using below versions:
scala - 2.11
kafka - kafka_2.11-0.8.2.1
java - 8
cassandra - datastax-community-64bit_2.2.8
Please help to resolve the issue.
As expected, it was dependency issue which is resolved by updating sbt file as below:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-RC1",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.0").
exclude("org.spark-project.spark", "unused")
)