sbt assembly package dependencies in multiple artifacts - scala

I am trying to generate, with sbt assembly, from a single project several jars. Each containing some of the dependencies.
So far I have found only this QA that is close to what I am looking for. However I don't need to have separate configs, basically when I run assembly, I just want to generate all the different jars.
To be more concrete. I want to generate:
One jar with my code and some general dependencies
One jar with hadoop dependencies <- this is the problem, as I don't know how to say, generate another jar that has only those dependencies.
One jar with scala

Without going deep into complex sbt configurations, you could try another approach. The hadoop dependencies being standard, you could mark them as provided in your build to exclude them.
"org.apache.hadoop" % "hadoop-client" % "2.6.0" % "provided"
For Scala, the library jar is also standard and can be downloaded separately by your "user". To remove it from the fat jar, use the following setting (assembly 0.13.0):
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
The user of your fat jar is then aked to provide both Scala and Hadoop libraries in the classpath.
For example, when using Spark this is the correct approach as these two libraries are both provided by the Spark running environment. The same logic applies for the Hadoop MapReduce environment.

Related

How to configure SBT to produce separate jars for dependencies and application code?

I'm using SBT as a build tool for spark projects. I'm able to create a fat jar of my dependencies using the sbt-assembly plugin.
However, this produces a ~120M jar, mostly of dependencies which I need to keep uploading to S3 to run my code -- this takes 3-5 minutes to do. Not a lot of time, but fairly annoying.
What would improve things a lot would be to have SBT produce a jar of the dependencies (which changes rarely), and a small jar of my application code which I should be able to upload in a few seconds.
Is this possible? I'm pretty new to SBT.
sbt-assembly supports out-of-the-box splitting project application jar from dependency jar. To produce just the dependency jar execute
sbt assemblyPackageDependency
To produce jar with just your project application code define assemblyOption as follows
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false, includeDependency = false)
and execute sbt assembly as usual.

SBT plugin to forbid use of auto-imported dependencies

I have this line of code in my build.sbt file:
libraryDependencies ++= Seq("com.foo" %% "lib" % "1.2.3")
Imagine that this library depends on "com.bar.lib" lib. Now in my code I can import com.bar.lib._ and it'll work. But I don't want this to compile, so maybe there is SBT plugin out there just for this purpose?
One of libraries I'm using depends on old cats version. I spent really long time to understand why mapN method not works... I just never imported a newer version of cats in the subproject.
SBT offers the intransitive and exclude features to deal with issues like this, as #earldouglas points out. See: https://www.scala-sbt.org/1.x/docs/Library-Management.html
You replied:
I tried to do so, but intransitive() don't import transitive dependencies (so I need to import all of them by hand to make it compile).
Yes, that is what it is for
What I want is something that will warn me about using libraries not directly imported in SBT file.
So you want transitive dependencies on your classpath, but you want the compiler to reject uses of transitive classes in your project code while allowing them in library code?
That is not a sensible approach: at runtime, these transitive dependencies will be on the classpath. The JVM classpath does not distinguish between different kinds of dependencies; such distinction only exists in SBT at build time.
You would be much better served by either
including a newer version of the cats library, overriding the transitive dep or
excluding the transitively included cats library, if it is broken.
However, I think you probably could achieve what you want by setting different dependencies at different build stages:
at Compile stage, include the dependency with intransitive. Your code should compile against your direct dependencies, but fail if you referenced any transitive dependencies
at Runtime stage, include the dependency with its transitive deps
the SBT code might look like this (untested):
(libraryDependencies in Compile) ++= Seq("com.foo" %% "lib" % "1.2.3" intransitive())
(libraryDependencies in Runtime) ++= Seq("com.foo" %% "lib" % "1.2.3")

Including a Spark Package JAR file in a SBT generated fat JAR

The spark-daria project is uploaded to Spark Packages and I'm accessing spark-daria code in another SBT project with the sbt-spark-package plugin.
I can include spark-daria in the fat JAR file generated by sbt assembly with the following code in the build.sbt file.
spDependencies += "mrpowers/spark-daria:0.3.0"
val requiredJars = List("spark-daria-0.3.0.jar")
assemblyExcludedJars in assembly := {
val cp = (fullClasspath in assembly).value
cp filter { f =>
!requiredJars.contains(f.data.getName)
}
}
This code feels like a hack. Is there a better way to include spark-daria in the fat JAR file?
N.B. I want to build a semi-fat JAR file here. I want spark-daria to be included in the JAR file, but I don't want all of Spark in the JAR file!
The README for version 0.2.6 states the following:
In any case where you really can't specify Spark dependencies using sparkComponents (e.g. you have exclusion rules) and configure them as provided (e.g. standalone jar for a demo), you may use spIgnoreProvided := true to properly use the assembly plugin.
You should then use this flag on your build definition and set your Spark dependencies as provided as I do with spark-sql:2.2.0 in the following example:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
Please note that by setting this your IDE may no longer have the necessary dependencies references to compile and run your code locally, which would mean that you would have to add the necessary JARs to the classpath by hand. I do this often on IntelliJ, what I do is having a Spark distribution on my machine and adding its jars directory to the IntelliJ project definition (this question may help you with that, should you need it).

Shading over third party classes

I'm currently facing a problem with deploying an uber-jar to a Spark Streaming application, where there are congruent JARs with different versions which are causing spark to throw run-time exceptions. The library in question is TypeSafe Config.
After attempting many things, my solution was to defer to shading the provided dependency so it won't clash with the JAR provided by Spark at run-time.
Hence, I went to the documentation for sbt-assembly and under shading, I saw the following example:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.commons.io.**" -> "shadeio.#1")
.inLibrary("commons-io" % "commons-io" % "2.4", ...).inProject
)
Attempting to shade over com.typesafe.config, I tried applying the following solution to my build.sbt:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "shadeio.#1").inProject
)
I assumed it was supposed to rename any reference to TypeSafe Config in my project. But, this doesn't work. It matches multiple classes in my project and causing them to be removed from the uber jar. I see this when trying to run sbt assembly:
Fully-qualified classname does not match jar entry:
jar entry: ***/Identifier.class
class name: **/Identifier.class
Omitting ***/OtherIdentifier.class.
Fully-qualified classname does not match jar entry:
jar entry: ***\SparkBaseJobRunner$$anonfun$1.class
class name: ***/SparkBaseJobRunner$$anonfun$1.class
I also attempted using:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "shadeio.#1")
.inLibrary("com.typesafe" % "config" % "1.3.0")
This did finish the assemblying process of the uber JAR, but didn't have the desired run time effect.
I'm not sure I fully comprehend the effect shading has on my build process with sbt.
How can I shade over references to com.typesafe.config in my project so when I invoke the library at run-time Spark will load my shaded library and avoid the clash caused by versioning?
I'm running sbt-assembly v0.14.1
Turns out this was a bug in sbt-assembly where shading was completely broken on Windows. This caused source files to be removed from the uber JAR, and for tests to fail as the said classes were unavailable.
I created a pull request to fix this. Starting version 0.14.3 of SBT, the shading feature works properly. All you need to do is update to the relevant version in plugins.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
In order to shade a specific JAR in your project, you do the following:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "my_conf.#1")
.inLibrary("com.typesafe" % "config" % "1.3.0")
.inProject
)
This will rename the com.typesafe.config assembly to be packaged inside my_conf. You can then view this using jar -tf on your assembly (omitted irrelevant parts for brevity):
***> jar -tf myassembly.jar
my_conf/
my_conf/impl/
my_conf/parser/
Edit
I wrote a blog post describing the issue and the process that led to it for anyone interested in a more in-depth explanation.

How to resolve a non-jar (dll/jnilib) library dependencies in sbt?

In a SBT build.sbt project file, is it possible to retrieve library dependencies which are not bundled as jar?
In my case, I am trying to use QTSampledSP which requires .dll and .jnilib libraries.
To download the artifact, you need to make Ivy (and hence sbt) explicitly aware of the DLL artifact. Add the following to build.sbt in your project.
lazy val QtSampledJniLibArt = Artifact("qtsampledsp-osx", "jnilib", "jnilib")
libraryDependencies += "com.tagtraum" % "qtsampledsp-osx" % "0.9.6" artifacts(QtSampledJniLibArt)
resolvers += "beatunes" at "http://www.beatunes.com/repo/maven2"
Then you need to tell sbt to pay attention to these artifacts (again build.sbt):
classpathTypes ++= Set("jnilib", "dll")
By default, sbt will only add a few types into the classpath (and jnilib and dll are not amongst them).
[sbt-0-13-1]> help classpathTypes
Artifact types that are included on the classpath.
[sbt-0-13-1]> show classpathTypes
[info] Set(eclipse-plugin, bundle, hk2-jar, orbit, jar)
Since these DLLs/jnilibs are needed on the classpath to run correctly, the above setting classpathTypes where you add the additional types will correct things as you can see below (don't forget to reload when in sbt console).
[sbt-0-13-1]> show classpathTypes
[info] Set(eclipse-plugin, bundle, hk2-jar, jnilib, orbit, jar, dll)
If you need to look in more detail at these files, check out the update report (from the update task) where you can inspect all configurations/modules/artifacts. Run show update in sbt console and look at the files in target/resolution-cache/reports.