assemblyMergeStrategy causing scala.MatchError when compiling - scala

I'm new to sbt/assembly. I'm trying to resolve some dependency problems, and it seems the only way to do it is through a custom merge strategy. However, whenever I try to add a merge strategy I get a seemingly random MatchError on compiling:
[error] (*:assembly) scala.MatchError: org/apache/spark/streaming/kafka/KafkaUtilsPythonHelper$$anonfun$13.class (of class java.lang.String)
I'm showing this match error for the kafka library, but if I take out that library altogether, I get a MatchError on another library. If I take out all the libraries, I get a MatchError on my own code. None of this happens if I take out the "assemblyMergeStrategy" block. I'm clearly missing something incredibly basic, but for the life of me I can't find it and I can't find anyone else that has this problem. I've tried the older mergeStrategy syntax, but as far as I can read from the docs and SO, this is the proper way to write it now. Please help?
Here is my project/assembly.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
And my project.sbt file:
name := "Clerk"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.1" % "provided",
"org.apache.kafka" %% "kafka" % "0.8.2.1",
"ch.qos.logback" % "logback-classic" % "1.1.7",
"net.logstash.logback" % "logstash-logback-encoder" % "4.6",
"com.typesafe.scala-logging" %% "scala-logging" % "3.1.0",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.1",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.1").
exclude("org.spark-project.spark", "unused")
)
assemblyMergeStrategy in assembly := {
case PathList("org.slf4j", "impl", xs # _*) => MergeStrategy.first
}
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

You're missing a default case for your merge strategy pattern match:
assemblyMergeStrategy in assembly := {
case PathList("org.slf4j", "impl", xs # _*) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Related

Log4j vulnerability while using Scala and Spark with sbt

I am working on a scala spark project.
I am using below dependencies:
libraryDependencies ++=
Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" ,
"org.apache.spark" %% "spark-sql" % "2.2.0" ,
"org.apache.spark" %% "spark-hive" % "2.2.0"
),
with scalaVersion set to :
ThisBuild / scalaVersion := "2.11.8"
and i am getting below error:
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.logging.log4j#log4j-api;2.11.1: Resolution failed several times for dependency: org.apache.logging.log4j#log4j-api;2.11.1 {compile=[compile(*), master(*)], runtime=[runtime(*)]}::
[error] typesafe-ivy-releases: unable to get resource for org.apache.logging.log4j#log4j-api;2.11.1: res=https://repo.typesafe.com/typesafe/ivy-releases/org.apache.logging.log4j/log4j-api/2.11.1/ivys/ivy.xml: java.io.IOException: Unexpected response code for CONNECT: 403
[error] sbt-plugin-releases: unable to get resource for org.apache.logging.log4j#log4j-api;2.11.1: res=https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.apache.logging.log4j/log4j-api/2.11.1/ivys/ivy.xml: java.io.IOException: Unexpected response code for CONNECT: 403
Security team has reached out to us to delete the vulnerable log4j-core jar. After which the projects which are using it as transitive dependencies are failing.
Is there a way on just upgrading the log4j version without upgrading scala or spark versions?
It should be a way where i can force the compiler to not fetch log4j-core jar of previous version which is vulnerable and in its place can use 2.17.2 version which is not vulnerable.
I have tried :
dependencyOverrides += "org.apache.logging.log4j" % "log4j-core" % "2.17.2"
also i have excludeAll option in sbt with spark dependencies but both solutions didnt worked out for me.
I just made few updates:
Added below settings to my sbt project:
Updated below settings to use a newer version: in build.properties and assembly.sbt respectively
sbt.version=1.6.2
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0")
Added the log4j dependencies on the top so that any transitive dependency now can use a newer version.
Given below is the sample snippet of one of my project:
name := "project name"
version := "0.1"
scalaVersion := "2.11.8"
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.**" -> "shaded.#1").inAll
)
lazy val root = (project in file(".")).settings(
test in assembly := {}
)
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.17.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.17.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.17.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.0" % "provided"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.0" % "test"
libraryDependencies += "com.typesafe" % "config" % "1.3.1"
libraryDependencies += "org.scalaj" %% "scalaj-http" % "2.4.0"
Below should be provided only in case of conflicts between dependencies if there are any:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case PathList("org", "slf4j", xs#_*) => MergeStrategy.first
case x => MergeStrategy.first
}

ScalaSpark: unable to create dataframe with scala-client dependency

I need to support couchbase version 6 with spark 2.3 or 2.4 and scala version is 2.11.12. I am facing an issue while creating a data frame.
SBT code snippet
scalaVersion := "2.11.12"
resolvers += "Couchbase Snapshots" at "http://files.couchbase.com/maven2"
val sparkVersion = "2.3.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.couchbase.client" %% "spark-connector" % "2.3.0",
"com.couchbase.client" %% "scala-client" % "1.0.0-alpha.3")
Code
val spark = SparkSession
.builder()
.appName("Example")
.master("local[*]")
.config("spark.couchbase.nodes", "10.12.12.88") // connect to Couchbase Server on localhost
.config("spark.couchbase.username", "abcd") // with given credentials
.config("spark.couchbase.password", "abcd")
.config("spark.couchbase.bucket.beer-sample", "") // open the travel-sample bucket
.getOrCreate()
val sc = spark.sparkContext
import com.couchbase.spark.sql._
val sql = spark.sqlContext
val dataframe = sql.read.couchbase()
val result = dataframe.collect()
Exception
Caused by: java.lang.ClassNotFoundException: com.couchbase.client.core.message.CouchbaseRequest
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
Tried:
As per the suggestion added a dependency
"com.couchbase.client" % "core-io" % "1.7.6",
Without scala-client dependency I am able to get dataframe but with scala-client unable to fix. please suggest a solution for this problem
I have made the changes to your build.sbt file and have added settings for the sbt-assembly plugin.
scalaVersion := "2.11.12"
resolvers += "Couchbase Snapshots" at "http://files.couchbase.com/maven2"
val sparkVersion = "2.3.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-streaming" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"com.couchbase.client" %% "spark-connector" % "2.3.0")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
You need to create a file called plugins.sbt in the directory called project and add the following line to it:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
Once done, run the command sbt clean compile assembly in the projects' root directory. it should build your jar.

scala.MatchError: org\apache\commons\io\IOCase.class (of class java.lang.String) in sbt+assembly

When I user sbt assembly, it prints error like this:
[error] (*:assembly) scala.MatchError: org\apache\commons\io\IOCase.class (of class java.lang.String)
and these are my configurations:
1、assembly.sbt:
import AssemblyKeys._
assemblySettings
mergeStrategy in assembly := {
case PathList("org", "springframework", xs#_*) => MergeStrategy.last
}
2、bulid.sbt
import AssemblyKeys._
lazy val root = (project in file(".")).
settings(
name := "DmpRealtimeFlow",
version := "1.0",
scalaVersion := "2.11.8",
libraryDependencies += "com.jd.ads.index" % "ad_index_dmp_common" % "0.0.4-SNAPSHOT",
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0" % "provided",
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0" % "provided",
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.8",
libraryDependencies += "org.springframework" % "spring-beans" % "3.1.0.RELEASE",
libraryDependencies += "org.springframework" % "spring-context" % "3.1.0.RELEASE",
libraryDependencies += "org.springframework" % "spring-core" % "3.1.0.RELEASE",
libraryDependencies += "org.springframework" % "spring-orm" % "3.1.0.RELEASE",
libraryDependencies += "org.mybatis" % "mybatis" % "3.2.1" % "compile",
libraryDependencies += "org.mybatis" % "mybatis-spring" % "1.2.2",
libraryDependencies += "c3p0" % "c3p0" % "0.9.1.2"
)
3、project tools:
sbt:0.13.5
assembly:0.11.2
java:1.7
scala:2.11.8
any help?
The problem may be in the missing default case in mergeStrategy in assembly block :
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
Also, mergeStrategy is deprecated and assemblyMergeStrategy should be used instead.
Basically the
{
case PathList("org", "springframework", xs#_*) => MergeStrategy.last
}
is a partial function String => MergeStrategy defined for only one type of inputs, i.e. for classes with package prefix "org\springframework". However, it is applied to all class files in the project and the first one that doesn't match the prefix above (org\apache\commons\io\IOCase.class) causes MatchError.

Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging

I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured.
Running commands:
me#Mycomputer:~/spark-2.1.0$ bin/spark-submit --class ETLApp /home/me/src/etl/target/scala-2.10/etl-assembly-0.1.0.jar
Output:
...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
build.sbt:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
javacOptions ++= Seq("-source", "1.8", "-target", "1.8")
mainClass := Some("ETLApp")
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.5.2";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2";
libraryDependencies += "org.apache.curator" % "curator-recipes" % "2.6.0"
libraryDependencies += "org.apache.curator" % "curator-test" % "2.6.0"
libraryDependencies += "args4j" % "args4j" % "2.32"
java -version
java version "1.8.0_101"
scala -version
2.10.5
spark version
2.1.0
Any hints welcomed. Thanks
in that case, your jar must bring all dependend classes along when being submitted to spark.
in maven this would be possible with the assembly plugin and the jar-with-dependencies descriptor. with sbt a quick google found this: https://github.com/sbt/sbt-assembly
you can change your build.sbt as follows:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
scalacOptions ++= Seq("-deprecation",
"-feature",
"-Xfuture",
"-encoding",
"UTF-8",
"-unchecked",
"-language:postfixOps")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-sql" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming-kafka" % "1.5.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2",
"org.apache.curator" % "curator-recipes" % "2.6.0",
"org.apache.curator" % "curator-test" % "2.6.0",
"args4j" % "args4j" % "2.32")
mainClass in assembly := Some("your.package.name.ETLApp")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
add the sbt-assembly plugin to your plugins.sbt file under the project directory in your Project's Root directory. Running sbt assembly in the Terminal(Linux) or CMD(Windows) in the root directory of your project would download all the dependencies for you and create an U

Why does sbt assembly in Spark project fail with "Please add any Spark dependencies by supplying the sparkVersion and sparkComponents"?

I work on a sbt-managed Spark project with spark-cloudant dependency. The code is available on GitHub (on spark-cloudant-compile-issue branch).
I've added the following line to build.sbt:
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
And so build.sbt looks as follows:
name := "Movie Rating"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies ++= {
val sparkVersion = "1.6.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"org.apache.kafka" % "kafka-log4j-appender" % "0.9.0.0",
"org.apache.kafka" % "kafka-clients" % "0.9.0.0",
"org.apache.kafka" %% "kafka" % "0.9.0.0",
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
)
}
assemblyMergeStrategy in assembly := {
case PathList("org", "apache", "spark", xs # _*) => MergeStrategy.first
case PathList("scala", xs # _*) => MergeStrategy.discard
case PathList("META-INF", "maven", "org.slf4j", xs # _* ) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
unmanagedBase <<= baseDirectory { base => base / "lib" }
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
When I execute sbt assembly I get the following error:
java.lang.RuntimeException: Please add any Spark dependencies by
supplying the sparkVersion and sparkComponents. Please remove:
org.apache.spark:spark-core:1.6.0:provided
Probably related: https://github.com/databricks/spark-csv/issues/150
Can you try adding spIgnoreProvided := true to your build.sbt?
(This might not be the answer and I could have just posted a comment but I don't have enough reputation)
NOTE I still can't reproduce the issue, but think it does not really matter.
java.lang.RuntimeException: Please add any Spark dependencies by supplying the sparkVersion and sparkComponents.
In your case, your build.sbt misses a sbt resolver to find spark-cloudant dependency. You should add the following line to build.sbt:
resolvers += "spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
PROTIP I strongly recommend using spark-shell first and only when you're comfortable with the package switch to sbt (esp. if you're new to sbt and perhaps other libraries/dependencies too). It's too much to digest in one bite. Follow https://spark-packages.org/package/cloudant-labs/spark-cloudant.