Scala Spark Streaming unit test with spark-testing-base throws error

Scala Spark Streaming unit test with spark-testing-base throws error - scala

I was trying to run a unit test on my spark streaming code with spark-testing-base. And I am having trouble running their sample-codes.
Here is the code snippet I copied
import com.holdenkarau.spark.testing.SharedSparkContext
import org.scalatest.FunSuite
class SampleTest extends FunSuite with SharedSparkContext {
test("test initializing spark context") {
val list = List(1, 2, 3, 4)
val rdd = sc.parallelize(list)
assert(rdd.count === list.length)
}
}
And here is the stacktrace of errors.
18/10/19 02:08:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.parallelize(SparkContext.scala:718)
at com.myproject.SampleTest$$anonfun$1.apply(DStreamTransformSpec.scala:11)
at com.myproject.analytic.SampleTest$$anonfun$1.apply(DStreamTransformSpec.scala:9)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
I am including my build.sbt not sure if this help.
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubsub" % "2.2.0"
libraryDependencies += "com.typesafe.play" %% "play-json" % "2.6.7"
libraryDependencies += "com.google.cloud" % "google-cloud-datastore" % "1.40.0"
// For test
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % "test"
libraryDependencies += "com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.10.0" % "test"
fork in Test := true
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.common.**" -> "repackaged.com.google.common.#1").inAll,
ShadeRule.rename("com.google.protobuf.**" -> "repackaged.com.google.protobuf.#1").inAll
)
And is there any other tools, that are recommended to do testing with DStream?

Possibly you might have forgotten? I don't see it in build.sbt
parallelExecution in Test := false

Related

Log4j vulnerability while using Scala and Spark with sbt

I am working on a scala spark project.
I am using below dependencies:
libraryDependencies ++=
Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" ,
"org.apache.spark" %% "spark-sql" % "2.2.0" ,
"org.apache.spark" %% "spark-hive" % "2.2.0"
),
with scalaVersion set to :
ThisBuild / scalaVersion := "2.11.8"
and i am getting below error:
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.logging.log4j#log4j-api;2.11.1: Resolution failed several times for dependency: org.apache.logging.log4j#log4j-api;2.11.1 {compile=[compile(*), master(*)], runtime=[runtime(*)]}::
[error] typesafe-ivy-releases: unable to get resource for org.apache.logging.log4j#log4j-api;2.11.1: res=https://repo.typesafe.com/typesafe/ivy-releases/org.apache.logging.log4j/log4j-api/2.11.1/ivys/ivy.xml: java.io.IOException: Unexpected response code for CONNECT: 403
[error] sbt-plugin-releases: unable to get resource for org.apache.logging.log4j#log4j-api;2.11.1: res=https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.apache.logging.log4j/log4j-api/2.11.1/ivys/ivy.xml: java.io.IOException: Unexpected response code for CONNECT: 403
Security team has reached out to us to delete the vulnerable log4j-core jar. After which the projects which are using it as transitive dependencies are failing.
Is there a way on just upgrading the log4j version without upgrading scala or spark versions?
It should be a way where i can force the compiler to not fetch log4j-core jar of previous version which is vulnerable and in its place can use 2.17.2 version which is not vulnerable.
I have tried :
dependencyOverrides += "org.apache.logging.log4j" % "log4j-core" % "2.17.2"
also i have excludeAll option in sbt with spark dependencies but both solutions didnt worked out for me.

I just made few updates:
Added below settings to my sbt project:
Updated below settings to use a newer version: in build.properties and assembly.sbt respectively
sbt.version=1.6.2
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0")
Added the log4j dependencies on the top so that any transitive dependency now can use a newer version.
Given below is the sample snippet of one of my project:
name := "project name"
version := "0.1"
scalaVersion := "2.11.8"
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.**" -> "shaded.#1").inAll
)
lazy val root = (project in file(".")).settings(
test in assembly := {}
)
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.17.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.17.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.17.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.0" % "provided"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.0" % "test"
libraryDependencies += "com.typesafe" % "config" % "1.3.1"
libraryDependencies += "org.scalaj" %% "scalaj-http" % "2.4.0"
Below should be provided only in case of conflicts between dependencies if there are any:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case PathList("org", "slf4j", xs#_*) => MergeStrategy.first
case x => MergeStrategy.first
}

Scala module 2.8.11 requires Jackson Databind version >= 2.8.0 and < 2.9.0

I'm using Scala 2.11 and Spark 2.4.3 for our AWS glue jobs. Recently, I got the error message below in our the build pipeline.
Cause: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.8.11 requires Jackson Databind version >= 2.8.0 and < 2.9.0
I've tried -
Change the the jackson-module-scala version from 2.8.11 to 2.12.0. This fixed the build pipeline, but I get a different error message in the glue job
Exception in User Class: java.lang.ExceptionInInitializerError.
Change the jackson-module-scala version from 2.8.11 to 2.13.1. I refactored code to get the unit tests, and fix the build pipeline, but in the glue job get the error message below
Exception in User Class: java.lang.VerifyError : Bad return type
I've tried added dependencyOverrides with jackson-module-scala version 2.12.0. The build pipeline would work, but the glue job would fail.
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"
Got any ideas what I'm doing wrong or how I can fix my issue?
See below for the build.sbt file
name in ThisBuild := "etl"
organization in ThisBuild := "XXXXXXXXXX"
scalaVersion in ThisBuild := "2.11.12"
version in ThisBuild := "0.2"
addCommandAlias("sanity", ";clean ;compile ;test ;scalafmtAll ;scalastyle ;assembly")
lazy val framework = project.settings(settings, libraryDependencies ++= commonDependencies)
lazy val scripts = project
.settings(settings, libraryDependencies ++= commonDependencies)
.dependsOn(framework % "compile->compile;test->test")
lazy val settings = Seq(
test in assembly := {},
scalacOptions ++= Seq(),
resolvers ++= Seq(
Resolver.sonatypeRepo("releases"),
"aws-glue-etl-artifacts" at "https://aws-glue-etl-artifacts.s3.amazonaws.com/release/"
),
assemblyMergeStrategy in assembly := {
case PathList("META-INF", "io.netty.versions.properties", xs # _*) => MergeStrategy.singleOrError
case "module-info.class" => MergeStrategy.discard
case x: String if x.contains("UnusedStubClass") => MergeStrategy.first
case y =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(y)
}
)
lazy val commonDependencies = Seq(
"com.amazonaws" % "AWSGlueETL" % "1.0.0" % Provided,
"com.databricks" %% "spark-xml" % "0.8.0",
"org.scalatest" %% "scalatest" % "3.1.1" % Test,
"org.scalamock" %% "scalamock" % "4.4.0" % Test,
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.8.11",
"io.netty" % "netty-all" % "4.1.17.Final" % Test,
"org.apache.spark" %% "spark-avro" % "2.4.3",
"com.crealytics" %% "spark-excel" % "0.13.6"
)
fork in ThisBuild := true
parallelExecution in ThisBuild := true
testForkedParallel in ThisBuild := false
logBuffered in ThisBuild := false
testOptions in ThisBuild += Tests.Argument(TestFrameworks.ScalaTest, "-oDFG")
javaOptions ++= Seq(
"-XX:+CMSClassUnloadingEnabled",
"-XX:MaxMetaspaceSize=512M",
"-XX:MetaspaceSize=256M",
"-Xms512M",
"-Xmx2G",
"-XX:MaxPermSize=2048M"
)
The function below uses the Jackson module scala 2.8.11 (up to 2.12.0)
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
object JsonUtils {
private val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
def fromJson[T](json: String)(implicit m: Manifest[T]): T = {
mapper.readValue[T](json)
}
}

U need to override it, like this:
dependencyOverrides ++= Seq(
"com.fasterxml.jackson.core" % "jackson-databind" % versions("jackson"),
"com.fasterxml.jackson.core" % "jackson-core" % versions("jackson"))

I've resolved my issue.
In the function, I changed the library import from
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
to
import com.fasterxml.jackson.module.scala.ScalaObjectMapper
In the build.sbt, I added the following lines
libraryDependencies += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.12.0"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.6.7.1"

Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud

The spark docmentation suggests using spark-hadoop-cloud to read / write from S3 in https://spark.apache.org/docs/latest/cloud-integration.html .
There is no apache spark published artifact for spark-hadoop-cloud. Then when trying to use the Cloudera published module the following exception occurs
Exception in thread "main" java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
This seems like a classpath conflict. Then it seems like it's not possible to use spark-hadoop-cloud to read with the vanilla apache spark 3.1.2 jars
version := "0.0.1"
scalaVersion := "2.12.12"
lazy val app = (project in file("app")).settings(
assemblyPackageScala / assembleArtifact := false,
assembly / assemblyJarName := "uber.jar",
assembly / mainClass := Some("com.example.Main"),
// more settings here ...
)
resolvers += "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % "3.1.1.3.1.7270.0-253"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1.7.2.7.0-184"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.1.7.2.7.0-184"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "0.21.3" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
import org.apache.spark.sql.SparkSession
object SparkApp {
def main(args: Array[String]){
val spark = SparkSession.builder().master("local")
//.config("spark.jars.repositories", "https://repository.cloudera.com/artifactory/cloudera-repos/")
//.config("spark.jars.packages", "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
.appName("spark session").getOrCreate
val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
jsonDF.show()
csvDF.show()
}
}

To read and write to S3 from Spark you only need these 2 dependencies:
"org.apache.hadoop" % "hadoop-aws" % hadoopVersion,
"org.apache.hadoop" % "hadoop-common" % hadoopVersion
Make sure the haddopVersion is the same used by your worker nodes and make sure your workers node also have these dependencies available. The rest of your code looks correct.

Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging

I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured.
Running commands:
me#Mycomputer:~/spark-2.1.0$ bin/spark-submit --class ETLApp /home/me/src/etl/target/scala-2.10/etl-assembly-0.1.0.jar
Output:
...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
build.sbt:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
javacOptions ++= Seq("-source", "1.8", "-target", "1.8")
mainClass := Some("ETLApp")
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.5.2";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2";
libraryDependencies += "org.apache.curator" % "curator-recipes" % "2.6.0"
libraryDependencies += "org.apache.curator" % "curator-test" % "2.6.0"
libraryDependencies += "args4j" % "args4j" % "2.32"
java -version
java version "1.8.0_101"
scala -version
2.10.5
spark version
2.1.0
Any hints welcomed. Thanks

in that case, your jar must bring all dependend classes along when being submitted to spark.
in maven this would be possible with the assembly plugin and the jar-with-dependencies descriptor. with sbt a quick google found this: https://github.com/sbt/sbt-assembly

you can change your build.sbt as follows:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
scalacOptions ++= Seq("-deprecation",
"-feature",
"-Xfuture",
"-encoding",
"UTF-8",
"-unchecked",
"-language:postfixOps")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-sql" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming-kafka" % "1.5.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2",
"org.apache.curator" % "curator-recipes" % "2.6.0",
"org.apache.curator" % "curator-test" % "2.6.0",
"args4j" % "args4j" % "2.32")
mainClass in assembly := Some("your.package.name.ETLApp")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
add the sbt-assembly plugin to your plugins.sbt file under the project directory in your Project's Root directory. Running sbt assembly in the Terminal(Linux) or CMD(Windows) in the root directory of your project would download all the dependencies for you and create an U

Why does sbt assembly in Spark project fail with "Please add any Spark dependencies by supplying the sparkVersion and sparkComponents"?

I work on a sbt-managed Spark project with spark-cloudant dependency. The code is available on GitHub (on spark-cloudant-compile-issue branch).
I've added the following line to build.sbt:
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
And so build.sbt looks as follows:
name := "Movie Rating"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies ++= {
val sparkVersion = "1.6.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"org.apache.kafka" % "kafka-log4j-appender" % "0.9.0.0",
"org.apache.kafka" % "kafka-clients" % "0.9.0.0",
"org.apache.kafka" %% "kafka" % "0.9.0.0",
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
)
}
assemblyMergeStrategy in assembly := {
case PathList("org", "apache", "spark", xs # _*) => MergeStrategy.first
case PathList("scala", xs # _*) => MergeStrategy.discard
case PathList("META-INF", "maven", "org.slf4j", xs # _* ) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
unmanagedBase <<= baseDirectory { base => base / "lib" }
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
When I execute sbt assembly I get the following error:
java.lang.RuntimeException: Please add any Spark dependencies by
supplying the sparkVersion and sparkComponents. Please remove:
org.apache.spark:spark-core:1.6.0:provided

Probably related: https://github.com/databricks/spark-csv/issues/150
Can you try adding spIgnoreProvided := true to your build.sbt?
(This might not be the answer and I could have just posted a comment but I don't have enough reputation)

NOTE I still can't reproduce the issue, but think it does not really matter.
java.lang.RuntimeException: Please add any Spark dependencies by supplying the sparkVersion and sparkComponents.
In your case, your build.sbt misses a sbt resolver to find spark-cloudant dependency. You should add the following line to build.sbt:
resolvers += "spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
PROTIP I strongly recommend using spark-shell first and only when you're comfortable with the package switch to sbt (esp. if you're new to sbt and perhaps other libraries/dependencies too). It's too much to digest in one bite. Follow https://spark-packages.org/package/cloudant-labs/spark-cloudant.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala Spark Streaming unit test with spark-testing-base throws error - scala

Possibly you might have forgotten? I don't see it in build.sbt parallelExecution in Test := false

Related

Log4j vulnerability while using Scala and Spark with sbt

Scala module 2.8.11 requires Jackson Databind version >= 2.8.0 and < 2.9.0

Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud

Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging

Why does sbt assembly in Spark project fail with "Please add any Spark dependencies by supplying the sparkVersion and sparkComponents"?

Categories

Resources