scala.MatchError when creating fat jar using sbt assembly - scala

I am trying to create a jar file for my project. I am using sbt assembly command to generate one.
But getting error when it starts merging files:
scala.MatchError:
spray\http\parser\ProtocolParameterRules$$anonfun$DeltaSeconds$1.class
(of class java.lang.String)
My build.sbt looks like this:
lazy val commonSettings = Seq(
name := "SampleSpray",
version := "1.0",
scalaVersion := "2.11.7",
organization := "com.test"
)
mainClass in assembly := Some("com.example.Boot")
lazy val root = (project in file(".")).
settings(commonSettings: _*).
settings(
name := "test",
resolvers += "spray repo" at "http://repo.spray.io",
libraryDependencies ++= {
val akkaV = "2.3.9"
val sprayV = "1.3.3"
Seq(
"io.spray" %% "spray-can" % sprayV,
"io.spray" %% "spray-routing" % sprayV,
"io.spray" %% "spray-json" % "1.3.2",
"io.spray" %% "spray-testkit" % sprayV % "test",
"com.typesafe.akka" %% "akka-actor" % akkaV,
"com.typesafe.akka" %% "akka-testkit" % akkaV % "test",
"org.specs2" %% "specs2-core" % "2.3.11" % "test",
"com.sksamuel.elastic4s" %% "elastic4s-core" % "2.1.0",
"com.sksamuel.elastic4s" %% "elastic4s-jackson" % "2.1.0",
"net.liftweb" %% "lift-json" % "2.6+"
)
}
)
assemblyOption in assembly := (assemblyOption in assembly).value.copy(cacheUnzip = false)
assemblyMergeStrategy in assembly := {
case "BaseDateTime.class" => MergeStrategy.first
}
Don't know why the error is coming.

The setting assemblyMergeStrategy in assembly has the type String => MergeStrategy.
In your sbt file you are using the partial function
{
case "BaseDateTime.class" => MergeStrategy.first
}
which is syntactic sugar for
(s:String) => {
s match {
case "BaseDateTime.class" => MergeStrategy.first
}
}
This representation shows that the given function will not exhaustively match all passed strings. In your case sbt-assembly tried to merge the file named spray\http\parser\ProtocolParameterRules$$anonfun$DeltaSeconds$1.class into the fat jar, but could not find any matching merge strategy. You need a "default" case also:
(s:String) => {
s match {
case "BaseDateTime.class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
}
Or written as partial function:
{
case "BaseDateTime.class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
I also ran into the same issue when sbt-assembly's assembly task failed to create a fat jar due to a name conflict in the elasticsearch and its transitive joda-time dependencies. Elasticsearch redefines the class org.joda.time.base.BaseDateTime which is already implemented in the joda-time library. I've followed your approach to tell sbt-assembly how resolve this conflict using the following assemblyMergeStrategy:
assemblyMergeStrategy in assembly := {
case "org/joda/time/base/BaseDateTime.class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Related

Can't shade jars by shade plugin of SBT

I have many Deduplicate found... error when build project with SBT :
[error] Deduplicate found different file contents in the following:
[error] Jar name = netty-all-4.1.68.Final.jar, jar org = io.netty, entry target = io/netty/handler/ssl/SslProvider.class
[error] Jar name = netty-handler-4.1.50.Final.jar, jar org = io.netty, entry target = io/netty/handler/ssl/SslProvider.class
...
For now I consider the option with shading all libraries (as here):
libraryDependencies ++= Seq(
"com.rometools" % "rome" % "1.18.0",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.5", // log
"ch.qos.logback" % "logback-classic" % "1.4.5", // log
"com.lihaoyi" %% "upickle" % "1.6.0", // file-io
"net.liftweb" %% "lift-json" % "3.5.0", // json
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2" % "provided", // spark
"org.postgresql" % "postgresql" % "42.5.1", // spark + postgresql
)
So that I added the following shade-rules:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.lihaoyi.**" -> "crdaa.#1")
.inLibrary("com.lihaoyi" %% "upickle" % "1.6.0")
.inProject,
ShadeRule.rename("ch.qos.logback.**" -> "crdbb.#1")
.inLibrary("ch.qos.logback" % "logback-classic" % "1.4.5")
.inProject,
ShadeRule.rename("com.typesafe.**" -> "crdcc.#1")
.inLibrary("com.typesafe.scala-logging" %% "scala-logging" % "3.9.5")
.inProject,
ShadeRule.rename("org.apache.spark.spark-sql.**" -> "crddd.#1")
.inLibrary("org.apache.spark" %% "spark-sql" % "3.2.2")
.inProject,
ShadeRule.rename("org.apache.spark.spark-core.**" -> "crdee.#1")
.inLibrary("org.apache.spark" %% "spark-core" % "3.2.2")
.inProject,
ShadeRule.rename("com.rometools.**" -> "crdff.#1")
.inLibrary("com.rometools" % "rome" % "1.18.0")
.inProject,
ShadeRule.rename("org.postgresql.postgresql.**" -> "crdgg.#1")
.inLibrary("org.postgresql" % "postgresql" % "42.5.1")
.inProject,
ShadeRule.rename("net.liftweb.**" -> "crdhh.#1")
.inLibrary("net.liftweb" %% "lift-json" % "3.5.0")
.inProject,
)
But after reloading SBT when I start assembly I got the same errors with duplicates.
What can be problem here?
PS:
ThisBuild / scalaVersion := "2.13.10"
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.1.0")
Update
Finally I ditched the rename in favor of unmanagedJars + not including spark dependencies (most of the errors were caused by them) by setting provided option .
After that only Deduplicate-errors with module-info.class remains, but its solution (by changing merging strategy) is described in sbt-assembly-doc.
That is, I downloaded spark separately, copied their jars into ./jarlib directory (!!! not in ./lib directory), changed the following in build conf:
libraryDependencies ++= Seq(
//...
"org.apache.spark" %% "spark-sql" % "3.2.3" % "provided",
"org.apache.spark" %% "spark-core" % "3.2.3" % "provided",
)
unmanagedJars in Compile += file("./jarlib")
ThisBuild / assemblyMergeStrategy := {
case PathList("module-info.class") => MergeStrategy.discard
case x if x.endsWith("/module-info.class") => MergeStrategy.discard
case x =>
val oldStrategy = (ThisBuild / assemblyMergeStrategy).value
oldStrategy(x)
}
Spark-jars have been included in final jar
Update 2
As noted in comments unmanagedJars are useless in that case - so I removed unmanagedJars string from build.sbt
Noted Spark-jars which aren't included in final jar-file should be in class-path when you start jar.
In my case I copied Spark-jars + final jar to folder ./app and start jar by:
java -cp "./app/*" main.Main
... where main.Main is main-class.
Sometimes like this (put in your build.sbt) is how you typically remove the deduplication that comes when your libraries have overlapping libraries of their own:
assemblyMergeStrategy in assembly := {
case PathList("javax", "activation", _*) => MergeStrategy.first
case PathList("com", "sun", _*) => MergeStrategy.first
case "META-INF/io.netty.versions.properties" => MergeStrategy.first
case "META-INF/mime.types" => MergeStrategy.first
case "META-INF/mailcap.default" => MergeStrategy.first
case "META-INF/mimetypes.default" => MergeStrategy.first
case d if d.endsWith(".jar:module-info.class") => MergeStrategy.first
case d if d.endsWith("module-info.class") => MergeStrategy.first
case d if d.endsWith("/MatchersBinder.class") => MergeStrategy.discard
case d if d.endsWith("/ArgumentsProcessor.class") => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

No TypeTag available for String

I'm trying to run my fat jar using scala -classpath "target/scala-2.13/Capstone-assembly-0.1.0-SNAPSHOT.jar" src/main/scala/project/Main.scala, but I get an error caused by .toString: val generateUUID: UserDefinedFunction = udf((str: String) => nameUUIDFromBytes(str.getBytes).toString) No TypeTag available for String, when I run from IDE everything is working but not from jar
My build.sbt:
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.13.8"
lazy val root = (project in file("."))
.settings(
name := "Capstone"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.2.0",
"org.apache.spark" %% "spark-sql" % "3.2.0",
"org.scalatest" %% "scalatest" % "3.2.12" % "test",
"org.rogach" %% "scallop" % "4.1.0"
)
compileOrder := CompileOrder.JavaThenScala
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
If I delete .toString I get an error Schema for type java.util.UUID is not supported
I tried to change String to java.util.String or scala.Predef.String, but this didn't work

scala.MatchError: org\apache\commons\io\IOCase.class (of class java.lang.String) in sbt+assembly

When I user sbt assembly, it prints error like this:
[error] (*:assembly) scala.MatchError: org\apache\commons\io\IOCase.class (of class java.lang.String)
and these are my configurations:
1、assembly.sbt:
import AssemblyKeys._
assemblySettings
mergeStrategy in assembly := {
case PathList("org", "springframework", xs#_*) => MergeStrategy.last
}
2、bulid.sbt
import AssemblyKeys._
lazy val root = (project in file(".")).
settings(
name := "DmpRealtimeFlow",
version := "1.0",
scalaVersion := "2.11.8",
libraryDependencies += "com.jd.ads.index" % "ad_index_dmp_common" % "0.0.4-SNAPSHOT",
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0" % "provided",
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0" % "provided",
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.8",
libraryDependencies += "org.springframework" % "spring-beans" % "3.1.0.RELEASE",
libraryDependencies += "org.springframework" % "spring-context" % "3.1.0.RELEASE",
libraryDependencies += "org.springframework" % "spring-core" % "3.1.0.RELEASE",
libraryDependencies += "org.springframework" % "spring-orm" % "3.1.0.RELEASE",
libraryDependencies += "org.mybatis" % "mybatis" % "3.2.1" % "compile",
libraryDependencies += "org.mybatis" % "mybatis-spring" % "1.2.2",
libraryDependencies += "c3p0" % "c3p0" % "0.9.1.2"
)
3、project tools:
sbt:0.13.5
assembly:0.11.2
java:1.7
scala:2.11.8
any help?
The problem may be in the missing default case in mergeStrategy in assembly block :
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
Also, mergeStrategy is deprecated and assemblyMergeStrategy should be used instead.
Basically the
{
case PathList("org", "springframework", xs#_*) => MergeStrategy.last
}
is a partial function String => MergeStrategy defined for only one type of inputs, i.e. for classes with package prefix "org\springframework". However, it is applied to all class files in the project and the first one that doesn't match the prefix above (org\apache\commons\io\IOCase.class) causes MatchError.

Writing a file to Amazon S3

I am trying the following:
import awscala._, s3._
implicit val s3 = S3()
val bucket = s3.createBucket("acme-datascience-lab")
bucket.put("sample.txt", new java.io.File("sample.txt"))
I get the following error:
Exception in thread "main" java.lang.NoSuchFieldError: EU_CENTRAL_1
at awscala.Region0$.<init>(Region0.scala:27)
at awscala.Region0$.<clinit>(Region0.scala)
at awscala.package$.<init>(package.scala:3)
at awscala.package$.<clinit>(package.scala)
at awscala.s3.S3$.apply$default$2(S3.scala:18)
at com.acme.spark.FlightDelays.HistoricalFlightDelayOutput$.main(HistoricalFlightDelayOutput.scala:164)
at com.acme.spark.FlightDelays.HistoricalFlightDelayOutput.main(HistoricalFlightDelayOutput.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It occurs on this line of code:
implicit val s3 = S3()
Here is the contents of my build.sbt file:
import AssemblyKeys._
assemblySettings
name := "acme-get-flight-delays"
version := "0.0.1"
scalaVersion := "2.10.5"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.0",
"org.apache.spark" %% "spark-hive" % "1.6.0",
"org.scalanlp" %% "breeze" % "0.11.2",
"org.scalanlp" %% "breeze-natives" % "0.11.2",
"net.liftweb" %% "lift-json" % "2.5+",
"org.apache.hadoop" % "hadoop-client" % "2.6.0",
"org.apache.hadoop" % "hadoop-aws" % "2.6.0",
"com.amazonaws" % "aws-java-sdk" % "1.0.002",
"com.github.seratch" %% "awscala" % "0.5.+"
)
resolvers ++= Seq(
"Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven",
"JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/",
"Spray Repository" at "http://repo.spray.cc/",
"Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
"Akka Repository" at "http://repo.akka.io/releases/",
"Twitter4J Repository" at "http://twitter4j.org/maven2/",
"Apache HBase" at "https://repository.apache.org/content/repositories/releases",
"Twitter Maven Repo" at "http://maven.twttr.com/",
"scala-tools" at "https://oss.sonatype.org/content/groups/scala-tools",
"Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/",
"Second Typesafe repo" at "http://repo.typesafe.com/typesafe/maven-releases/",
"Mesosphere Public Repository" at "http://downloads.mesosphere.io/maven",
Resolver.sonatypeRepo("public")
)
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs # _*) => MergeStrategy.first
case PathList("org", "apache", xs # _*) => MergeStrategy.first
case PathList("org", "jboss", xs # _*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
}
// Configure JAR used with the assembly plug-in
jarName in assembly := "acme-get-flight-delays.jar"
// A special option to exclude Scala itself from our assembly JAR, since Spark
// already bundles in Scala.
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
You can use the AWS SDK for Java directly or a wrapper like AWScala that lets you do something like
import awscala._, s3._
implicit val s3 = S3()
val bucket = s3.bucket("your-bucket")
bucket.put("sample.txt", new java.io.File("sample.txt"))
Add the following to build.sbt:
libraryDependencies += "com.github.seratch" %% "awscala" % "0.3.+"
The following code will do the job:
implicit val s3 = S3()
s3.setRegion(com.amazonaws.regions.Region.getRegion(com.amazonaws.regions.Regions.US_EAST_1))
val bucket = s3.bucket("acme-datascience-lab")
bucket.get.put("sample.txt", new java.io.File("sample.txt"))

SBT Assembly - Deduplicate

I got the following SBT files:
.
-- root
-- plugins.sbt
-- build.sbt
With plugins.sbt containing the following:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
And build.sbt containing the following:
import sbt.Keys._
resolvers in ThisBuild ++= Seq("Apache Development Snapshot Repository" at "https://repository.apache.org/content/repositories/snapshots/", Resolver.sonatypeRepo("public"))
name := "flink-experiment"
lazy val commonSettings = Seq(
organization := "my.organisation",
version := "0.1.0-SNAPSHOT"
)
val flinkVersion = "1.1.0"
val sparkVersion = "2.0.0"
val kafkaVersion = "0.8.2.1"
val hadoopDependencies = Seq(
"org.apache.avro" % "avro" % "1.7.7" % "provided",
"org.apache.avro" % "avro-mapred" % "1.7.7" % "provided"
)
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-connector-kafka-0.8" % flinkVersion exclude("org.apache.kafka", "kafka_${scala.binary.version}")
)
val sparkDependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-8" % sparkVersion exclude("org.apache.kafka", "kafka_${scala.binary.version}")
)
val kafkaDependencies = Seq(
"org.apache.kafka" %% "kafka" % "0.8.2.1"
)
val toolDependencies = Seq(
"com.github.scopt" %% "scopt" % "3.5.0"
)
val testDependencies = Seq(
"org.scalactic" %% "scalactic" % "2.2.6",
"org.scalatest" %% "scalatest" % "2.2.6" % "test"
)
lazy val root = (project in file(".")).
settings(commonSettings: _*).
settings(
libraryDependencies ++= hadoopDependencies,
libraryDependencies ++= flinkDependencies,
libraryDependencies ++= sparkDependencies,
libraryDependencies ++= kafkaDependencies,
libraryDependencies ++= toolDependencies,
libraryDependencies ++= testDependencies
).
enablePlugins(AssemblyPlugin)
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in(Compile, run), runner in(Compile, run))
mainClass in assembly := Some("my.organization.experiment.Experiment")
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
Now sbt clean assembly sadly gives the following exception:
[error] (root/*:assembly) deduplicate: different file contents found in the following:
[error] /home/kevin/.ivy2/cache/org.apache.spark/spark-streaming-kafka-0-8_2.10/jars/spark-streaming-kafka-0-8_2.10-2.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/kevin/.ivy2/cache/org.apache.spark/spark-tags_2.10/jars/spark-tags_2.10-2.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/kevin/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
How can I fix this?
https://github.com/sbt/sbt-assembly#excluding-jars-and-files
you can define assemblyMergeStrategy and probably discard ony file that you listed as they are all in 'unused' package.
You can override the default strategy for conflicts:
val defaultMergeStrategy: String => MergeStrategy = {
case x if Assembly.isConfigFile(x) =>
MergeStrategy.concat
case PathList(ps # _*) if Assembly.isReadme(ps.last) || Assembly.isLicenseFile(ps.last) =>
MergeStrategy.rename
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps # (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "plexus" :: xs =>
MergeStrategy.discard
case "services" :: xs =>
MergeStrategy.filterDistinctLines
case ("spring.schemas" :: Nil) | ("spring.handlers" :: Nil) =>
MergeStrategy.filterDistinctLines
case _ => MergeStrategy.deduplicate
}
case _ => MergeStrategy.deduplicate
}
as you can see assembly default strategy is MergeStrategy.deduplicate, you can add a new case case UnusedStubClass => MergeStrategy.first