sbt-assembly: Generate a minimal JAR file - scala

I've been using sbt-assembly to generate standalone JAR file for my scala project. However, I would like to reduce the size of my JAR file (its currently around 150MB and there's defintely room for improvement there).
I used the following command to list the contents of the JAR file that's produced:
jar tf <JAR file>
This revealed that there are lots of classes in the generated JAR file that are not used in the project. I believe these classes get included as part of third-party JARs.
Questions
(a) Is there an option that I can use to instruct sbt-assembly to generate a minimal JAR file that does not include the third-party classes that are not used in my project?
(b) I could use AssemblyStrategy to manually specify which files need to be excluded. Is this a sound strategy? I'm a bit concerned that with this approach the JAR file might end up throwing unexpected ClassNotFound exceptions.
Thanks in advance.

It's not easy to say what's used in your project and what is not. If you include a dependency into a project it might bring a few other ones in. Those child dependencies might also require their own dependencies and so on.
By default if you include some dependency in your project you intend to use it. The author of a dependency usually does the same thing. Thus, there is usually not much you can throw away, it's there for a reason. There are couple cases when this is not true:
Dependency author includes additional dependencies that will be used only in some settings, and that does not apply to your project
You are using a mega-dependency when you actually need only one of its libraries/features.
There are counter examples to this as well: Scalatest does not ship pegdown for generating html test reports because you don't need it usually. But it might be needed if you try to use -h flag to generate html.
Imagine the case when you use Apache Tika for pdf parsing. It wraps PDFBox to do the parsing. You don't need a bloat of all other libraries in that case that parse MS documents. The best thing to do is not to exclude files manually via sbt exclude or sbt-assembly rules because there is a risk you get it wrong and get run time class loading exception. Instead you need to use the right dependency like PDFBox directly. Unfortunately this is a lot of manual work in many cases to figure out all dependencies that you need, so it's your choice: easy and fat JAR, or painful and lean.
There are two ways to exclude dependencies:
Exclude transitive dependencies with exclude. See the docs here.
Don't use the top level dependency and manually add its subdependencies as you need them.
Ok, one more less fun option: use provided and make sure libraries are copied to your target environment and are on classpath. If you have many jars using the same libraries this helps to share those.
You can visualize your dependency tree with this plugin: https://github.com/jrudolph/sbt-dependency-graph. It's very helpful when trying to figure out what you are using and what you can remove. There are some tools like tattletale and loosejar that people suggest but I haven't tried them. If anyone has experience with those please share.

What might want to look at are treeshakers
For Java there's the following (I have not tried/used it):
http://proguard.sourceforge.net/

Related

How to isolate libraries in an unmanaged dependency .jar file so they don't conflict with others

I need to add a .jar as an unmanaged dependency to an sbt Scala project (it is the java-stellar-sdk). Everything works well as long as I don't run sbt test. There seems to be a Mockito version in the .jar file that conflicts with the one I am using in the project. I get a lot of errors that certain Mockito matchers are not found but everything works fine without the .jar in the lib folder.
Is there a way to tell sbt that it should ignore certain libraries in the .jar or that managed dependencies take precedence? I also found this related question but obviously it didn't help me.
An alternative workaround would also help a lot. Is it possible to isolate the libraries in the jar in a way that allows me to just make a certain package visible to the outside?
Update: The .jar contains Mockito 2 but my project uses Mockito 1, so this is a very simple and obvious conflict, that I can solve by upgrading to Mockito 2 (which I tried and it works). However, the question remains: Is there another reasonable way to isolate the Mockito dependency in the .jar to not interfere with my project in case I can't or don't want to resolve the conflict buy switching to a newer version of the library in question. Maybe altering the .jar to rename the conflicting packages? I don't know. Something like that.
I know that this is a very general question that has likely been discussed somewhere else in depth. However, I didn't find anything that really satisfied me. Links to relevant discussions of the topic are of course appreciated as well.
I can think of 3 ways for you to do it (ordered from simple to difficult):
delete mockito 2 manually from the jar file.
Since the jar is just a zip file, you can extract it, delete all the conflicting files, and pack it again.
compile that jar from source by yourself, and set mockito as a test dependency (as it should be). If you do that, consider opening a PR with your change, to fix the problem for the community
Shade the mockito files in the jar.
shading is the process of renaming all files in a jar file by certain rules. you can either use jarjarlinks or with sbt assembly plugin. see this answer to get you started with sbt assembly: https://stackoverflow.com/a/47974750/245024
You should be able to arrange for your Mockito 1 classes to appear before the Mockito 2 classes on the classpath. That will cause your classes to win any conflicts.

Generate a JAR from one Scala source file

I have no Scala experience, but I need to create a JAR to include on a project's classpath from a single Scala source file.
I'm thinking there is a relatively straightforward way to do this, but I can't seem to figure it out.
The Scala file is here: http://pastebin.com/MYqjNkac
The JAR doesn't need to be executable, it just needs to be able to be referenced from another program.
The most convenient way is to use some build tool like Sbt or Maven. For maven there is the maven-scala-plugin plugin, and for Sbt here is a tutorial.
If you don't want to use any build tool, you may want to compile the code with scalac and then create the jar file manually by using zip on the resulting class files and renaming it to jar. But you have to preserve the directory structure. In your pastebin you use the package org.apache.spark.examples.pythonconverters, so make sure the directories match.
Btw, if you want to just integrate this piece of code with your java project, and using maven, you can have the scala code in your 1 project as well (in src/main/scala). Just use the maven-scala-plugin plugin and hook it to the compile phase, or some sooner phase if your Java code depends on it. However, I don't recommend mixing multiple languages in one project, I would split it into two separate ones.

What are best practices for using Hibernate's hbm2java?

I am using Hibernate, Maven, and Eclipse (STS build) to build a project. I'm using hbm.xml files to specify my schema. I want to use Hibernate's hbm2java to generate my model classes. I have it working well and generating the kind of code I want.
It runs perfectly from the command line, generating the model code and then building and testing as expected.
However, Eclipse seems unable to handle it. It will periodically "lose its mind" and be unable to resolve very simple imports and classes referenced in my DAO classes, which are hand-coded. The things it can't find are classes like HibernateUtil. Ironically, it appears to not have any trouble finding the model classes.
The unresolved classes are in target/classes/blah-blah folder at the end of the run. So they're apparently getting copied to the right place.
In a "continuous integration" environment, is it best to generate the sources once, commit them to my version control, and then disable code gen? Or is it possible to have the code generated each time, thus ensuring I pick up any database changes without human intervention?
IMHO, entities should be the core of your application, and should be designed, implemented and documented with care. They're supposed to be objects, with methods encapsulating behavior. Having them autogenerated is an absurdity, IMO.
Generating them at the very beginning might be an option to get you started, but once they've been generated, hand-craft them and don't generate them again. Add necessary properties and methods as the schema changes, and refactor existing code.
BTW, I really prefer using annotations for the mapping, because it's less verbose, less error-prone, and all the information is in a single place.
Try this:
From command line traverse to your project directory where the project's pom.xml is present and run:
mvn eclipse:clean eclipse:eclipse
If it says unable to find plugin eclipse then try:
mvn eclipse:install-plugin
First and then try the command above again.
In this way all the maven and project dependencies will be resolved at eclipse level also.
Let me know if this is not what you were looking for.

Selectively include dependencies in JAR

I have a library that I wrote in Scala that uses Bouncy Castle and has a whole bunch of dependencies. When I roll a jar, I can either roll a "fat" jar that has all the dependencies (including scala), which weighs in around 19 MB, or I can roll a skinny jar, which doesn't have dependencies, but is only a few hundred KB.
The problem is that I need to include the Bouncy Castle classes/jar with my library, because if its not on the classpath at runtime, all kinds of exceptions get thrown.
So, I think the ideal situation is if there is some way that I can get either Maven or SBT to include some but not all dependencies in the jar that gets rolled. Some dependencies are needed at compile-time, but not at run time, such as the Scala standard libraries. Is there some way to get that to happen?
Thanks!
I would try out the sbt proguard plugin from https://github.com/nuttycom/sbt-proguard-plugin . It should be able to weed out the classes that are not in use.
If it is sufficient to explicitly define which dependencies should be added (one the artifact-level, i.e., single JARs), you can define an assembly (in case of a single project) or an additional assembly project (in case of a multi-module project). Assembly descriptors can explicitly exclude/include artifacts from the dependencies.
Here is some good documentation on this topic (section 8.5.4), here is the official documentation.
Note that you can include all artifacts that belong to one group by using the wildcard notation in dependecySets, e.g. hibernate:*:jar would include all JAR files belonging to the hibernate group.
Covering maven...
Because you declare your project to be dependent upon bouncy castle in your maven pom, anybody using maven to depend upon your library will by default pull in bouncy castle as a transitive dependency.
You should set the appropriate scope on your dependencies, e.g. compile for stuff needed at compile and runtime, test for dependencies only needed in testing and provided for stuff you expect to be provided by the environment.
Whether your library's dependencies are packaged into dependent projects when they are built is a question of how those are projects configured and setting the scopes will influence the default behaviour.
For example, jar type packaging by default does not include dependencies, whereas war will include those in compile scope (but not test or provided). The design aim here was to have packaging plugins behave in the most commonly required way without needing configuration, but of course packaging plugins in maven can be configured to have different behaviour if needed. The plugins themselves which do packaging are well documented at the apache maven site.
If users of your library are unlikely to be using maven to build their projects, an option is to use the shade plugin which will allow you to produce an "uber-jar" which contains all the dependencies you wish. You can configure particular includes or excludes.
This can be a problematic way to deliver, for example where your library includes dependencies which version clash with the direct dependencies of projects using it, i.e. they use a different version of the same libraries yours does.
However if you can it is best that you leave this to maven to manage so that projects using your library can decide whether they want your dependencies or to specify particular versions giving them more flexibility. This is the idiomatic approach.
For more information on dependencies and scopes in maven, see the reference guide published by Sonatype.
I'm not a scala guy, but I have played around with assembling stuff in Java + Maven.
Have you tried looking into creating your own assembly descriptor for the assembly plugin? https://maven.apache.org/plugins/maven-assembly-plugin/assembly.html
You can copy / paste the jar-with-dependencies descriptor then just add some excludes to your < dependencySet >. I'm not a Maven expert, but you should be able to configure it so different profiles kick off different assembly builds.
EDIT: Ack, didn't see my HTML got hidden

eclipse, one classpath for compiling, another for launching

example:
For logging, my code uses log4j. but other jars my code is dependent upon, uses slf4j instead. So both jars must be in the build path. Unfortunately, its possible for my code to directly use (depend on) slf4j now, either by context-assist, or some other developers changes. I would like any use of slf4j to show up as an error, but my application (and tests) will still need it in the classpath when running.
explanation:
I'd like to find out if this is possible in eclipse. This scenario happens often for me. I'll have a large project, that uses alot of 3rd party libraries. And of course those 3rd party jars have their own dependencies as well. So I have to include all dependencies in the classpath ("build path" in eclipse) for the application and its tests to compile and run (from within eclipse).
But I don't want my code to use all of those jars, just the few direct dependencies I've decided upon myself. So if my code accidentally uses a dependency of a dependency, I want it to show up as a compilation error. Ideally, as class not found, but any error would do.
I know I can manually configure the classpath when running outside of eclipse, and even within eclipse I can modify the classpath for a specific class I'm running (in the run configurations), but thats not manageable if you run alot of individual test cases, or have alot of main() classes.
It sounds like your project has enough dependency relationships that you might consider structuring it with OSGi bundles (plug-ins). Each bundle gets its own classloader and gets to specify what bundles (and optionally what version ranges, etc.) it depends on, what packages it exports, whether it re-exports stuff from its dependencies, etc.
Eclipse itself is structured out of Eclipse plug-ins and fragments, which are just OSGi bundles with an optional tiny bit of additional Eclipse wiring (plugin.xml, which is used to declare Eclipse "extension points" and "extensions") attached. Eclipse thus has fairly good tooling for creating and managing bundles built-in (via the Plug-in Development Environment). Much of what you find out there may lead you to conflate "OSGi bundle" with "plug-in that extends the Eclipse IDE", but the two concepts are quite separable.
The Eclipse tooling does distinguish rather clearly (and sometimes annoyingly, but in the "helpful medicine" way) between the bundles in your build environment vs. the bundles that a particular run configuration includes.
After a few years of living in OSGi land, the default Java "flat classpath" feels weird and even kind of broken to me, largely because (as you've experienced) it throws all JARs into one giant arena and hopes they can sort of work things out. The OSGi environment gives me a lot more control over dependency relationships, and as a "side effect" also naturally demands clarification of those relationships. Between these clear declarations and the tooling's enforcement of them, the project's structure is more obvious to everyone on the team.
if my code accidentally uses a dependency of a dependency, I want it to show up as a compilation error. Ideally, as class not found, but any error would do.
Put your code in one plug-in, your direct dependencies in other plug-ins, their dependencies in other plug-ins, etc. and declare each plug-in's dependencies. Eclipse will immediately do exactly what you want. You won't be offered dependencies' dependencies' contents in autocompletes; you'll get red squiggles and build errors; etc.
Why not use access rules to keep your code clean?
It looks like it would better be managed with maven, integrated in eclipse with m2eclipse.
That way, you can only execute part of the maven build lifecycle, and you can manage separate set of dependencies per build steps.
In my experience it helps to be more resrictive, I made the team filling out (paper) forms why this jar is needed and what license...
and they did rather type in a few lines of code instead of drag along 20 jars to open a file using only one line of code, or another fancy 'feature'.
Using maven could help for a while, but when you first spot jars having names like nightly-build or snapshot, you will know you're in jar-hell.
conclusion: Choose dependencies well
Would using the slf4j-over-log4j jar be useful? That allows using slf4j with actual logging going to log4j.