How to shade packages inside a fat jar depdency - scala

I've an SBT project that depends on
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.2"
which is bringing a recent version of google-api-services-storage.
I've also another dependency to Sparkling Water which is a fat jar that seems to bring a different version of google-api-services-storage packaged with the rest of this library classes inside that fat jar.
Now whenever I try to use gcs-connector it fails with this error:
Caused by: java.lang.NoSuchMethodError: com.google.api.services.storage.Storage$Objects$List.setIncludeTrailingDelimiter(Ljava/lang/Boolean;)Lcom/google/api/services/storage/Storage$Objects$List;
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createListRequest(GoogleCloudStorageImpl.java:1401)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listStorageObjectsAndPrefixes(GoogleCloudStorageImpl.java:1272)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:1443)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.lambda$getFileInfoInternal$8(GoogleCloudStorageFileSystem.java:1059)
It seems the class loader is picking up the classes from inside that fat jar.
In this scenario how can I shade the classes inside the fat jar? I tried the following but without success (see https://github.com/h2oai/sparkling-water/issues/2643)
settings = Seq(
...,
libraryDependencies ++= Seq(...),
assemblyShadeRules in assembly ++= Seq(
ShadeRule.rename("com.google.api.services.storage.**" -> "shade.#1")
.inLibrary("the fat jar group" % "the fat jar linbary" % "the fat jar version")
)
)
Update: after shading google dependency in the original I did run into another dependency issue:
Caused by: java.lang.NoSuchMethodError: io.opencensus.trace.Span.addMessageEvent(Lio/opencensus/trace/MessageEvent;)V
at com.google.api.client.http.OpenCensusUtils.recordMessageEvent(OpenCensusUtils.java:222)
at com.google.api.client.http.OpenCensusUtils.recordSentMessageEvent(OpenCensusUtils.java:190)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1010)
at com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:304)
at com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:324)
at com.google.cloud.hadoop.util.CredentialFactory$GoogleCredentialWithRetry.executeRefreshToken(CredentialFactory.java:170)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:470)
at com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:201)

Related

Library Dependencies bundled together to create jar file is not present in the jar file using sbt-assembly

I have multiple libraries dependency in the sbt.build file. I am creating the final jar file using sbt-assembly so it includes all the dependent libraries in the jar itself.
But using jar tvf jarname.jar, I am not able to find all libraries there.
I need this to bundle all libraries in a jar and provide it to spark-shell with spark-shell --jar jarpath and then using the import command to use the libraries there.
This is done because this is not possible for me to import packages directly to spark-shell
using the spark-shell --packages command.
Expected:
Adding the jar file to the spark-shell and then importing all libraries there which should be present in jar
Found the solution here:
Some of my dependencies includes the "provided" tag and thus it was not getting included in the fat jar.
libraryDependencies += "org.apache.flink" %% "flink-table-planner" % flinkVersion % "provided"

Including a Spark Package JAR file in a SBT generated fat JAR

The spark-daria project is uploaded to Spark Packages and I'm accessing spark-daria code in another SBT project with the sbt-spark-package plugin.
I can include spark-daria in the fat JAR file generated by sbt assembly with the following code in the build.sbt file.
spDependencies += "mrpowers/spark-daria:0.3.0"
val requiredJars = List("spark-daria-0.3.0.jar")
assemblyExcludedJars in assembly := {
val cp = (fullClasspath in assembly).value
cp filter { f =>
!requiredJars.contains(f.data.getName)
}
}
This code feels like a hack. Is there a better way to include spark-daria in the fat JAR file?
N.B. I want to build a semi-fat JAR file here. I want spark-daria to be included in the JAR file, but I don't want all of Spark in the JAR file!
The README for version 0.2.6 states the following:
In any case where you really can't specify Spark dependencies using sparkComponents (e.g. you have exclusion rules) and configure them as provided (e.g. standalone jar for a demo), you may use spIgnoreProvided := true to properly use the assembly plugin.
You should then use this flag on your build definition and set your Spark dependencies as provided as I do with spark-sql:2.2.0 in the following example:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
Please note that by setting this your IDE may no longer have the necessary dependencies references to compile and run your code locally, which would mean that you would have to add the necessary JARs to the classpath by hand. I do this often on IntelliJ, what I do is having a Spark distribution on my machine and adding its jars directory to the IntelliJ project definition (this question may help you with that, should you need it).

Shading over third party classes

I'm currently facing a problem with deploying an uber-jar to a Spark Streaming application, where there are congruent JARs with different versions which are causing spark to throw run-time exceptions. The library in question is TypeSafe Config.
After attempting many things, my solution was to defer to shading the provided dependency so it won't clash with the JAR provided by Spark at run-time.
Hence, I went to the documentation for sbt-assembly and under shading, I saw the following example:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.commons.io.**" -> "shadeio.#1")
.inLibrary("commons-io" % "commons-io" % "2.4", ...).inProject
)
Attempting to shade over com.typesafe.config, I tried applying the following solution to my build.sbt:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "shadeio.#1").inProject
)
I assumed it was supposed to rename any reference to TypeSafe Config in my project. But, this doesn't work. It matches multiple classes in my project and causing them to be removed from the uber jar. I see this when trying to run sbt assembly:
Fully-qualified classname does not match jar entry:
jar entry: ***/Identifier.class
class name: **/Identifier.class
Omitting ***/OtherIdentifier.class.
Fully-qualified classname does not match jar entry:
jar entry: ***\SparkBaseJobRunner$$anonfun$1.class
class name: ***/SparkBaseJobRunner$$anonfun$1.class
I also attempted using:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "shadeio.#1")
.inLibrary("com.typesafe" % "config" % "1.3.0")
This did finish the assemblying process of the uber JAR, but didn't have the desired run time effect.
I'm not sure I fully comprehend the effect shading has on my build process with sbt.
How can I shade over references to com.typesafe.config in my project so when I invoke the library at run-time Spark will load my shaded library and avoid the clash caused by versioning?
I'm running sbt-assembly v0.14.1
Turns out this was a bug in sbt-assembly where shading was completely broken on Windows. This caused source files to be removed from the uber JAR, and for tests to fail as the said classes were unavailable.
I created a pull request to fix this. Starting version 0.14.3 of SBT, the shading feature works properly. All you need to do is update to the relevant version in plugins.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
In order to shade a specific JAR in your project, you do the following:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "my_conf.#1")
.inLibrary("com.typesafe" % "config" % "1.3.0")
.inProject
)
This will rename the com.typesafe.config assembly to be packaged inside my_conf. You can then view this using jar -tf on your assembly (omitted irrelevant parts for brevity):
***> jar -tf myassembly.jar
my_conf/
my_conf/impl/
my_conf/parser/
Edit
I wrote a blog post describing the issue and the process that led to it for anyone interested in a more in-depth explanation.

Adding an SBT plugin which does not specify an SBT version in its URL

Specs2 does not define the SBT version in its URL:
https://oss.sonatype.org/content/repositories/releases/org/specs2/specs2_2.9.2/1.12.3/
This is causing problems for SBT when trying to resolve it...
[warn] ==== sonatype-snapshots: tried
[warn] https://oss.sonatype.org/content/repositories/snapshots/org/specs2/specs2_2.9.2_0.12/1.12.3/specs2-1.12.3.pom
[warn] ==== sonatype-releases: tried
[warn] https://oss.sonatype.org/content/repositories/releases/org/specs2/specs2_2.9.2_0.12/1.12.3/specs2-1.12.3.pom
How do I get SBT to resolve the correct URL?
specs2 is not a sbt plugin it's a Scala library for writing executable software specifications.
There are two levels of sbt projects. Your own projects (for now call them "apps") and the build project definition itself (call it "the build").
library dependencies
When apps use other libraries during compilation or test, they are called library dependencies (or "deps" for short). These deps are declared in build.sbt (or *.sbt or project/*.scala) as follows:
libraryDependencies += "org.specs2" %% "specs2" % "2.2" % "test"
By saying %%, artifacts published using sbt automatically appends Scala binary version postfix such as _2.10 on Maven. This is due to the fact that (unlike Java) not all Scala releases are binary compatible with each other. Scala 2.9.1 and 2.9.2 are not compatible, so they both have distinct postfix _2.9.1 and _2.9.2, but Scala 2.10.x are all compatible among the series so they are given _2.10.
Unfortunately, however, different versions of Specs2 are required for Scala versions, you might have to do something more like:
libraryDependencies <+= scalaVersion({
case "2.9.2" => "org.specs2" %% "specs2" % "1.12.3" % "test"
case x if x startsWith "2.10" => "org.specs2" %% "specs2" % "2.2" % "test"
})
For more details check out Getting Started guide.
sbt plugins
There are special type of libraries that the build can depend on to extend its abilities, and they are sbt plugins. These are declared in project/plugins.sbt (or project/*.sbt) as follows:
addSbtPlugin("com.eed3si9n" % "sbt-buildinfo" % "0.2.5")
Since sbt plugins are dependent on sbt version and the Scala version that the build uses, both of those information are encoded somehow into the published artifact path. On Ivy, they are expressed as folder names but on Maven they are expressed as postfix:
http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-buildinfo/scala_2.9.2/sbt_0.12/0.2.5/
https://oss.sonatype.org/content/repositories/public/org/scalaxb/sbt-scalaxb_2.10_0.13/1.1.2/

Including Hyperic Sigar library within jar while using sbt assembly for Scala project

I'm building a Scala project with sbt and creating a fat jar with the sbt-assembly plugin. I'm able to add unmanaged jars (such as the Sigar jar) by adding the following to build.sbt.
unmanagedJars in Compile +=
file("lib/hyperic-sigar-1.6.4/sigar-bin/lib/sigar.jar")
However, when I try running this, I get the following error because the *.so libraries are not included in the jar.
no libsigar-amd64-linux.so in java.library.path
org.hyperic.sigar.SigarException: no libsigar-amd64-linux.so in java.library.path
at org.hyperic.sigar.Sigar.loadLibrary(Sigar.java:172)
at org.hyperic.sigar.Sigar.<clinit>(Sigar.java:100)
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.hyperic.sigar.ptql.SigarProcessQuery.create(Ljava/lang/String;)V
at org.hyperic.sigar.ptql.SigarProcessQuery.create(Native Method)
at org.hyperic.sigar.ptql.ProcessQueryFactory.getQuery(ProcessQueryFactory.java:66)
at org.hyperic.sigar.ptql.ProcessFinder.findSingleProcess(ProcessFinder.java:44)
The libraries I want to include are in lib/hyperic-sigar-1.6.4/sigar-bin/lib/*.so and they need to be linked to a directory in the classpath within the jar. The only way I know of to do a mapping like this is:
resourceDirectory in Compile <<=
baseDirectory{ _ / "lib/hyperic-sigar-1.6.4/sigar-bin/lib" }
This causes the *.so libraries to be added to root of the jar, but not a specific directory. How can I specify a resource map to map from lib/hyperic-sigar-1.6.4/sigar-bin/lib/*.so to a directory in the classpath in my jar? What is the terminology for what I'm trying to do?
Assuming that sigar is indeed capable of loading native libs from classpath, this should do the trick:
libraryDependencies += "org.fusesource" % "sigar" % "1.6.4" classifier("native") classifier("")
Otherwise, you need to unpack them from the jar manually and provide proper java.library.path