sbt assembly, including my jar - scala

I want to build a 'fat' jar of my code. I understand how to do this mostly but all the examples I have use the idea that the jar is not local and I am not sure how to include into my assembled jar another JAR that I built that the scala code uses. Like what folder does this JAR I have to include reside in?
Normally when I run my current code as a test using spark-shell it looks like this:
spark-shell --jars magellan_2.11-1.0.6-SNAPSHOT.jar -i st_magellan_abby2.scala
(the jar file is right in the same path as the .scala file)
So now I want to build a build.sbt file that does the same and includes that SNAPSHOT.jar file?
name := "PSGApp"
version := "1.0"
scalaVersion := "2.11.8"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
//provided means don't included it is there. already on cluster?
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0" % "provided",
//add magellan here somehow?
)
So where would I put the jar in the SBT project folder structure so it gets picked up when I run sbt assembly? Is that in the main/resources folder? Which the reference manual says is where 'files to include in the main jar' go?
What would I put in the libraryDependencies here so it knows to add that specific jar and not go out into the internet to get it?
One last thing, I was also doing some imports in my test code that doesn't seem to fly now that I put this code in an object with a def main attached to it.
I had things like:
import sqlContext.implicits._ which was right in the code above where it was about to be used like so:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.udf
val distance =udf {(a: Point, b: Point) =>
a.withinCircle(b, .001f); //current radius set to .0001
}
I am not sure can I just keep these imports inside the def main? or do I have to move them elsewhere somehow? (Still learning scala and wrangling the scoping I guess).

One way is to build your fat jar using the assembly plugin (https://github.com/sbt/sbt-assembly) locally and publishLocal to store the resulting jar into your local ivy2 cache
This will make it available for inclusion in your other project based on build.sbt settings in this project, eg:
name := "My Project"
organization := "org.me"
version := "0.1-SNAPSHOT"
Will be locally available as "org.me" %% "my-project" % "0.1-SNAPSHOT"
SBT will search local cache before trying to download from external repo.
However, this is considered bad practise, because only final project should ever be a fat-jar. You should never include one as dependency (many headaches).
There is no reason to make project magellan a fat-jar if library is included in PGapp. Just publishLocal without assembly
Another way is to make projects dependant on each other as code, not library.
lazy val projMagellan = RootProject("../magellan")
lazy val projPSGApp = project.in(file(".")).dependsOn(projMagellan)
This makes compilation in projPSGApp tigger compilation in projMagellan.
It depends on your use case though.
Just don't get in a situation where you have to manage your .jar manually
The other question:
import sqlContext.implicits._ should always be included in the scope where dataframe actions are required, so you shouldn't put that import near the other ones in the header
Update
Based on discussion in comments, my advise would be:
Get the magellan repo
git clone git#github.com:harsha2010/magellan.git
Create a branch to work on, eg.
git checkout -b new-stuff
Change the code you want
Then update the versioning number, eg.
version := "1.0.7-SNAPSHOT"
Publish locally
sbt publishLocal
You'll see something like (after a while):
[info] published ivy to /Users/tomlous/.ivy2/local/harsha2010/magellan_2.11/1.0.7-SNAPSHOT/ivys/ivy.xml
Go to your other project
Change build.sbt to include
"harsha2010" %% "magellan" % "1.0.7-SNAPSHOT" in your libraryDependencies
Now you have a good (temp) reference to your library.
Your PSGApp should be build as an fat jar assembly to pass to Spark
sbt clean assembly
This will pull in the custom build jar
If the change in the magellan project is usefull for the rest of the world, you should push your changes and create a pull request, so that in the future you can just include the latest build of this library

Related

Setting up Scala project

Is there a standard in place for setting up a Scala project where the build.sbt is contained in a subdirectory?
I've cloned https://github.com/lightbend/cloudflow and opened it in IntelliJ, here is the structure:
Can see core contains build.sbt.
If I open the project core in a new project window then IntelliJ will recognise the Scala project.
How to compile the Scala project core while keeping the other folders available within the IntelliJ window?
EDIT:
If you do want to play around with the project, it should suffice to either import an SBT project and select core as the root. Intellij should also detect the build.sbt if you open core as the root.
Here is the SBT Reference Manual
Traditionally, build.sbt will be at the root of the project.
If you are looking to use their libraries, you should import them in your sbt file, you shouldn't clone the repo unless you intend to modify or fork their repo.
For importing libraries into your project take a look at the Maven Repository for Cloudflow, select the project(s), click on the version you want, and select the SBT tab. Just copy and paste those dependencies into your build.sbt. Once you build the project with SBT, you should have all those packages available to you.
So in [ProjectRoot]/build.sbt something along the lines of
val scalaVersion = "2.13.4"
lazy val root = (project in file("."))
.settings(
name := "Your_Project_Name",
scalaVersion := scalaVersion,
libraryDependencies += "com.lightbend.cloudflow" %% "cloudflow-streamlets" % "2.0.26-RC12",
// Either sequentially
libraryDependencies += "com.lightbend.cloudflow" %% "cloudflow-akka" % "2.0.26-RC12",
// Or as a sequence (note that you can have a trailing comma for these)
libraryDependencies ++= Seq(
"com.lightbend.cloudflow" %% "cloudflow-blueprint" % "2.0.26-RC12",
"com.lightbend.cloudflow" %% "cloudflow-flink" % "2.0.26-RC12", // No next elemenet
)
// Or combo using both like above, just don't forget the commas in between
)

Including a Spark Package JAR file in a SBT generated fat JAR

The spark-daria project is uploaded to Spark Packages and I'm accessing spark-daria code in another SBT project with the sbt-spark-package plugin.
I can include spark-daria in the fat JAR file generated by sbt assembly with the following code in the build.sbt file.
spDependencies += "mrpowers/spark-daria:0.3.0"
val requiredJars = List("spark-daria-0.3.0.jar")
assemblyExcludedJars in assembly := {
val cp = (fullClasspath in assembly).value
cp filter { f =>
!requiredJars.contains(f.data.getName)
}
}
This code feels like a hack. Is there a better way to include spark-daria in the fat JAR file?
N.B. I want to build a semi-fat JAR file here. I want spark-daria to be included in the JAR file, but I don't want all of Spark in the JAR file!
The README for version 0.2.6 states the following:
In any case where you really can't specify Spark dependencies using sparkComponents (e.g. you have exclusion rules) and configure them as provided (e.g. standalone jar for a demo), you may use spIgnoreProvided := true to properly use the assembly plugin.
You should then use this flag on your build definition and set your Spark dependencies as provided as I do with spark-sql:2.2.0 in the following example:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"
Please note that by setting this your IDE may no longer have the necessary dependencies references to compile and run your code locally, which would mean that you would have to add the necessary JARs to the classpath by hand. I do this often on IntelliJ, what I do is having a Spark distribution on my machine and adding its jars directory to the IntelliJ project definition (this question may help you with that, should you need it).

read application.conf from build.sbt

I found a lot of similar questions, but some of them are unanswered and others don't work for me.
My task is to read application.conf from build.sbt, to keep all configuration in a single place. But my build.sbt with code
import com.typesafe.config._
val conf = com.typesafe.config.ConfigFactory.parseFile(new File("conf/application.conf")).resolve()
version := conf.getString("app.version")
don't compile with error
error: object typesafe is not a member of package com
How to fix this problem?
To make this work it is important to know that sbt projects have a recursive structure, so when you run sbt you actually start a project that has as root folder the project folder within your root project.
Given that, to fix your problem you need to add typesafe-config as a library dependency also for you sbt project; in order to do that add a build.sbt in the project folder of your root project and specify the dependency in there as you would normally do (i.e. libraryDependencies += ...).
Use this in your code
import com.typesafe.config.{Config, ConfigFactory}
private val config = ConfigFactory.load()
and in your build.sbt file
libraryDependencies += "com.typesafe" % "config" % "1.3.0"

How do I resolve my own test artifacts in SBT?

One of my projects will provide a jar package supposed to be used for unit testing in several other projects. So far I managed to make sbt produce a objects-commons_2.10-0.1-SNAPSHOT-test.jar and have it published in my repository.
However, I can't find a way to tell sbt to use that artifact with the testing scope in other projects.
Adding the following dependencies in my build.scala will not get the test artifact loaded.
"com.company" %% "objects-commons" % "0.1-SNAPSHOT",
"com.company" %% "objects-commons" % "0.1-SNAPSHOT-test" % "test",
What I need is to use the default .jar file as compile and runtime dependency and the -test.jar as dependency in my test scope. But somehow sbt never tries to resolve the test jar.
How to use test artifacts
To enable publishing the test artifact when the main artifact is published you need to add to your build.sbt of the library:
publishArtifact in (Test, packageBin) := true
Publish your artifact. There should be at least two JARs: objects-commons_2.10.jar and objects-commons_2.10-test.jar.
To use the library at runtime and the test library at test scope add the following lines to build.sbt of the main application:
libraryDependencies ++= Seq("com.company" % "objects-commons_2.10" % "0.1-SNAPSHOT"
, "com.company" % "objects-commons_2.10" % "0.1-SNAPSHOT" % "test" classifier "tests" //for SBT 12: classifier test (not tests with s)
)
The first entry loads the the runtime libraries and the second entry forces that the "tests" artifact is only available in the test scope.
I created an example project:
git clone git#github.com:schleichardt/stackoverflow-answers.git --branch so15290881-how-do-i-resolve-my-own-test-artifacts-in-sbt
Or you can view the example directly in github.
Your problem is that sbt thinks that your two jars are the same artifact, but with different versions. It takes the "latest", which is 0.1-SNAPSHOT, and ignores the 0.1-SNAPSHOT-test. This is the same behaviour as you would see if, for instance you have 0.1-SNAPSHOT and 0.2-SNAPSHOT.
I don't know what is in these two jars, but if you want them both to be on the classpath, which is what you seem to want to do, then you'll need to change the name of the test artifact to objects-commons-test, as Kazuhiro suggested. It seems that this should be easy enough for you, since you're already putting it in the repo yourself.
It will work fine if you change the name like this.
"com.company" %% "objects-commons" % "0.1-SNAPSHOT",
"com.company" %% "objects-commons-test" % "0.1-SNAPSHOT" % "test",

When does a SBT package get downloaded/built?

I want to use http://dispatch.databinder.net/Dispatch.html .
The site indicates I must add this to project/plugins.sbt:
libraryDependencies += "net.databinder.dispatch" %% "core" % "0.9.1"
which I did. I then restarted the play console and compiled.
Importing doesnt work:
import dispatch._
Guess I have been silly, but then I never used a build system when using Java.
How must I trigger the process that downloads/builds the package? Where are the jars (or equivalent) stored; can I reuse them? When is the package available for use by the Play application?
It doesn't say you should add it to project/plugins.sbt. That is the wrong place. It says to add to the build.sbt file, on the root of your project. Being a Play project, project/Build.scala might be more appropriate -- I don't know if it will pick up settings from build.sbt or not.
To add the dependency in your Build.scala:
val appDependencies = Seq(
"net.databinder.dispatch" %% "core" % "0.9.1"
)
You probably need to run sbt update.
From the sbt Command Line Reference:
update Resolves and retrieves external dependencies as described in library dependencies.