Including a Spark Package JAR file in a SBT generated fat JAR - scala

The spark-daria project is uploaded to Spark Packages and I'm accessing spark-daria code in another SBT project with the sbt-spark-package plugin.
I can include spark-daria in the fat JAR file generated by sbt assembly with the following code in the build.sbt file.
spDependencies += "mrpowers/spark-daria:0.3.0"
val requiredJars = List("spark-daria-0.3.0.jar")
assemblyExcludedJars in assembly := {
val cp = (fullClasspath in assembly).value
cp filter { f =>
!requiredJars.contains(f.data.getName)
}
}
This code feels like a hack. Is there a better way to include spark-daria in the fat JAR file?
N.B. I want to build a semi-fat JAR file here. I want spark-daria to be included in the JAR file, but I don't want all of Spark in the JAR file!

The README for version 0.2.6 states the following:
In any case where you really can't specify Spark dependencies using sparkComponents (e.g. you have exclusion rules) and configure them as provided (e.g. standalone jar for a demo), you may use spIgnoreProvided := true to properly use the assembly plugin.
You should then use this flag on your build definition and set your Spark dependencies as provided as I do with spark-sql:2.2.0 in the following example:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
Please note that by setting this your IDE may no longer have the necessary dependencies references to compile and run your code locally, which would mean that you would have to add the necessary JARs to the classpath by hand. I do this often on IntelliJ, what I do is having a Spark distribution on my machine and adding its jars directory to the IntelliJ project definition (this question may help you with that, should you need it).

Related

Library Dependencies bundled together to create jar file is not present in the jar file using sbt-assembly

I have multiple libraries dependency in the sbt.build file. I am creating the final jar file using sbt-assembly so it includes all the dependent libraries in the jar itself.
But using jar tvf jarname.jar, I am not able to find all libraries there.
I need this to bundle all libraries in a jar and provide it to spark-shell with spark-shell --jar jarpath and then using the import command to use the libraries there.
This is done because this is not possible for me to import packages directly to spark-shell
using the spark-shell --packages command.
Expected:
Adding the jar file to the spark-shell and then importing all libraries there which should be present in jar
Found the solution here:
Some of my dependencies includes the "provided" tag and thus it was not getting included in the fat jar.
libraryDependencies += "org.apache.flink" %% "flink-table-planner" % flinkVersion % "provided"

sbt assembly, including my jar

I want to build a 'fat' jar of my code. I understand how to do this mostly but all the examples I have use the idea that the jar is not local and I am not sure how to include into my assembled jar another JAR that I built that the scala code uses. Like what folder does this JAR I have to include reside in?
Normally when I run my current code as a test using spark-shell it looks like this:
spark-shell --jars magellan_2.11-1.0.6-SNAPSHOT.jar -i st_magellan_abby2.scala
(the jar file is right in the same path as the .scala file)
So now I want to build a build.sbt file that does the same and includes that SNAPSHOT.jar file?
name := "PSGApp"
version := "1.0"
scalaVersion := "2.11.8"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
//provided means don't included it is there. already on cluster?
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0" % "provided",
//add magellan here somehow?
)
So where would I put the jar in the SBT project folder structure so it gets picked up when I run sbt assembly? Is that in the main/resources folder? Which the reference manual says is where 'files to include in the main jar' go?
What would I put in the libraryDependencies here so it knows to add that specific jar and not go out into the internet to get it?
One last thing, I was also doing some imports in my test code that doesn't seem to fly now that I put this code in an object with a def main attached to it.
I had things like:
import sqlContext.implicits._ which was right in the code above where it was about to be used like so:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.udf
val distance =udf {(a: Point, b: Point) =>
a.withinCircle(b, .001f); //current radius set to .0001
}
I am not sure can I just keep these imports inside the def main? or do I have to move them elsewhere somehow? (Still learning scala and wrangling the scoping I guess).
One way is to build your fat jar using the assembly plugin (https://github.com/sbt/sbt-assembly) locally and publishLocal to store the resulting jar into your local ivy2 cache
This will make it available for inclusion in your other project based on build.sbt settings in this project, eg:
name := "My Project"
organization := "org.me"
version := "0.1-SNAPSHOT"
Will be locally available as "org.me" %% "my-project" % "0.1-SNAPSHOT"
SBT will search local cache before trying to download from external repo.
However, this is considered bad practise, because only final project should ever be a fat-jar. You should never include one as dependency (many headaches).
There is no reason to make project magellan a fat-jar if library is included in PGapp. Just publishLocal without assembly
Another way is to make projects dependant on each other as code, not library.
lazy val projMagellan = RootProject("../magellan")
lazy val projPSGApp = project.in(file(".")).dependsOn(projMagellan)
This makes compilation in projPSGApp tigger compilation in projMagellan.
It depends on your use case though.
Just don't get in a situation where you have to manage your .jar manually
The other question:
import sqlContext.implicits._ should always be included in the scope where dataframe actions are required, so you shouldn't put that import near the other ones in the header
Update
Based on discussion in comments, my advise would be:
Get the magellan repo
git clone git#github.com:harsha2010/magellan.git
Create a branch to work on, eg.
git checkout -b new-stuff
Change the code you want
Then update the versioning number, eg.
version := "1.0.7-SNAPSHOT"
Publish locally
sbt publishLocal
You'll see something like (after a while):
[info] published ivy to /Users/tomlous/.ivy2/local/harsha2010/magellan_2.11/1.0.7-SNAPSHOT/ivys/ivy.xml
Go to your other project
Change build.sbt to include
"harsha2010" %% "magellan" % "1.0.7-SNAPSHOT" in your libraryDependencies
Now you have a good (temp) reference to your library.
Your PSGApp should be build as an fat jar assembly to pass to Spark
sbt clean assembly
This will pull in the custom build jar
If the change in the magellan project is usefull for the rest of the world, you should push your changes and create a pull request, so that in the future you can just include the latest build of this library

Shading over third party classes

I'm currently facing a problem with deploying an uber-jar to a Spark Streaming application, where there are congruent JARs with different versions which are causing spark to throw run-time exceptions. The library in question is TypeSafe Config.
After attempting many things, my solution was to defer to shading the provided dependency so it won't clash with the JAR provided by Spark at run-time.
Hence, I went to the documentation for sbt-assembly and under shading, I saw the following example:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.commons.io.**" -> "shadeio.#1")
.inLibrary("commons-io" % "commons-io" % "2.4", ...).inProject
)
Attempting to shade over com.typesafe.config, I tried applying the following solution to my build.sbt:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "shadeio.#1").inProject
)
I assumed it was supposed to rename any reference to TypeSafe Config in my project. But, this doesn't work. It matches multiple classes in my project and causing them to be removed from the uber jar. I see this when trying to run sbt assembly:
Fully-qualified classname does not match jar entry:
jar entry: ***/Identifier.class
class name: **/Identifier.class
Omitting ***/OtherIdentifier.class.
Fully-qualified classname does not match jar entry:
jar entry: ***\SparkBaseJobRunner$$anonfun$1.class
class name: ***/SparkBaseJobRunner$$anonfun$1.class
I also attempted using:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "shadeio.#1")
.inLibrary("com.typesafe" % "config" % "1.3.0")
This did finish the assemblying process of the uber JAR, but didn't have the desired run time effect.
I'm not sure I fully comprehend the effect shading has on my build process with sbt.
How can I shade over references to com.typesafe.config in my project so when I invoke the library at run-time Spark will load my shaded library and avoid the clash caused by versioning?
I'm running sbt-assembly v0.14.1
Turns out this was a bug in sbt-assembly where shading was completely broken on Windows. This caused source files to be removed from the uber JAR, and for tests to fail as the said classes were unavailable.
I created a pull request to fix this. Starting version 0.14.3 of SBT, the shading feature works properly. All you need to do is update to the relevant version in plugins.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
In order to shade a specific JAR in your project, you do the following:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "my_conf.#1")
.inLibrary("com.typesafe" % "config" % "1.3.0")
.inProject
)
This will rename the com.typesafe.config assembly to be packaged inside my_conf. You can then view this using jar -tf on your assembly (omitted irrelevant parts for brevity):
***> jar -tf myassembly.jar
my_conf/
my_conf/impl/
my_conf/parser/
Edit
I wrote a blog post describing the issue and the process that led to it for anyone interested in a more in-depth explanation.

sbt assembly package dependencies in multiple artifacts

I am trying to generate, with sbt assembly, from a single project several jars. Each containing some of the dependencies.
So far I have found only this QA that is close to what I am looking for. However I don't need to have separate configs, basically when I run assembly, I just want to generate all the different jars.
To be more concrete. I want to generate:
One jar with my code and some general dependencies
One jar with hadoop dependencies <- this is the problem, as I don't know how to say, generate another jar that has only those dependencies.
One jar with scala
Without going deep into complex sbt configurations, you could try another approach. The hadoop dependencies being standard, you could mark them as provided in your build to exclude them.
"org.apache.hadoop" % "hadoop-client" % "2.6.0" % "provided"
For Scala, the library jar is also standard and can be downloaded separately by your "user". To remove it from the fat jar, use the following setting (assembly 0.13.0):
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
The user of your fat jar is then aked to provide both Scala and Hadoop libraries in the classpath.
For example, when using Spark this is the correct approach as these two libraries are both provided by the Spark running environment. The same logic applies for the Hadoop MapReduce environment.

File of one of the sbt plugin's dependencies

I need to get hold of the File reference to a specific artifact during the setup phase of my sbt's plugin.
I've tried:
obtaining the ivy home directory, but that basically means assuming where ivy will place the files (they could be even be in a local maven)
parsing System.getProperty("java.class.path"), but it only contains the sbt-launch jar
obtaining the resolved sbt jars from the update.value setting, but it doesn't have any of the plugin's jars in the list! (only the jars for the application being compiled)
Short of invoking the Ivy API manually, is there any way to get the File to the plugin's jar dependency?
NOTE: This is a very specific part of how to write an sbt plugin to launch the app with an agent factored out into a separate question.
got it! adding the dependency explicitly within the source reveals its resolved path:
override val projectSettings = Seq(
libraryDependencies += "com.github.fommil.lion" %% "agent" % "1.0-SNAPSHOT",
javaOptions ++= Seq(s"-Dhack=${update.value}}")
)
has a reference in it!