Spark build.sbt file versioning - scala

I am having a hard time understanding the multiple version numbers going into the build.sbt file for spark programs.
1. version
2. scalaVersion
3. spark version?
4. revision number.
There are multiple compatibility between these versions as well.
Can you please explain how to decide these versions for my project.

I hope the following SBT lines and their comments will be sufficient to explain your question.
// The version of your project itself.
// You can change this value whenever you want,
// e.g. everytime you make a production release.
version := "0.1.0"
// The Scala version your project uses for compile.
// If you use spark, you can only use a 2.11.x version.
// Also, because Spark includes its own Scala in runtime
// I recommend you use the same one;
//you can check which one your Spark instance uses in the spark-shell.
scalaVersion := "2.11.12"
// The spark version the project uses for compile.
// Because you wont generate an uber jar with Spark included,
// but deploy your jar to an spark cluster instance.
// This version must match with the remote one, unless you want weird bugs...
val SparkVersion = "2.3.1"
// Note, I use a val with the Spark version
// to make it easier to include several Spark modules in my project,
// this way, if I want/have to change the Spark version,
// I only have to modify one line,
// and avoid strange erros because I changed some versions, but not others.
// Also note the 'Provided' modifier at the end,
// it indicates SBT that it shouldn't include the Spark bits in the generated jar
// neither in package nor assembly tasks.
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % SparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % SparkVersion % Provided,
)
// Exclude Scala from the assembly jar, because spark already includes it.
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
You should also take care of the SBT version, that is the version of the SBT used in your project. You set it in the "project/build.properties" file.
sbt.version=1.2.3
Note:
I use the sbt-assembly plugin, to generate a jar with all dependencies included except Spark and Scala. This is usefull if you use other libraries like the MongoSparkConnector for example.

Related

sbt dependencies ignoring version

In my build.sbt files I'm stating that I want to use version 18.9 from a library:
val finagleVersion = "18.9.0"
<zip>
lazy val commonDependencies = Seq(
<zip>,
"com.twitter" %% "finagle-core" % finagleVersion,
but this seems to be ignored when I run sbt with
scalacOptions ++= (compilerOptions :+ "-Ylog-classpath"),
which outputs all the jars used at compile time. And there I see that for every finagle dependency including core the 19.3 version is used:
C:\Users\<me>\.coursier\cache\v1\https\<me>%40<company repo>\artifactory\Central-cache\com\twitter\finagle-core_2.12\19.3.0\finagle-core_2.12-19.3.0.jar
Where is this "preference" for the latest versions coming from?
After using evicted and seeing which library overrides the version you want, you can opt to use dependencyOverrides. For example:
dependencyOverrides += "com.twitter" %% "finagle-core" % "18.9.0"
You do have to be careful though as the library that depends on Finagle too may require the newer version and break if you use the older version. That is why you should really check first which library is evicting the old version, and validate if it's ok to do so.
Also important, this is a livy-only feature so the override won't be present in the published pom.xml!

Listing the dependencies of a configuration with a custom Scala libray

I have a CLI app which compile only to 2.11 (because of some internal dependency).
I want to package this app as a sbt plugin. This sbt plugin run the app by forking the JVM, running separately with its own classpath to avoid Scala library conflict.
Obviously I need to download the scala 2.11 app with all its dependencies and I am using a custom Configuration for it. My issue is that when I try to list the dependencies it comes with the scala library configured by the project.
Specific code is here : https://github.com/thibaultdelor/CliAppSbtPlugin/blob/master/plugin/src/main/scala/com/thibaultdelor/MyWrapperPlugin.scala#L33
autoScalaLibrary in CliConfig := false,
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % "2.11.12" % CliConfig,
"com.thibaultdelor" % "mycli_2.11" % "0.0.1" % CliConfig
)
val dependencies = (update in CliConfig).value.select(configurationFilter(CliConfig.name))
Here, if the project has the scala version 2.12, dependencies will contains scala-library 2.12 instead of what 2.11 as I would like.
Any help welcome, I am stuck. The sample project is on github and has a failing test case for it.

Explanation of SBT build file

Question
Is .sbt file is a in scala or in sbt proprietary language? Please help to decipher the sbt build definition.
lazy val root = <--- Instance of the SBT Project object? Why "lazy"? Is "root" the reserved keyword for sbt to identify the project in the build.sbt?
(project in file(".")) <--- Is this defining a Project object regarding the current directory having the SBT expected project structure?
.settings( <--- Is this a scala function call of def settings in the Project object?
name := "NQueen",
version := "1.0",
scalaVersion := "2.11.8",
mainClass in Compile := Some("NQueen")
)
libraryDependencies ++= Seq( <--- libraryDependencies is a reserved keyword of type scala.collection.Seq using which sbt downloads and packages them as part of the output jar?
"org.apache.spark" %% "spark-core" % "2.3.0", <--- Concatenating and creating the full library name including version. I suppose I need to look into respective documentations to find out what to specify.
"org.apache.spark" %% "spark-mllib" % "2.3.0"
)
// <--- Please explain what this block does and where I can find the explanations.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.firs
}
Resources
Please suggest good resources to understand the design, mechanism, how .sbt works. I looked into the SBT getting started and documents but as Scala definition itself, it is difficult to understand. If it is make, ant, or maven, how things get pieced together and the design/mechanism are so much clear, but need to find good documentations or tutorials for SBT.
References
I looked into the references below trying to understand.
SBT: How to get started using the Build.scala file (instead of build.sbt)
What is the difference between build.sbt and build.scala?
SBT - Build definition
SBT Project object
scala.collection.Seq
SBT Library dependencies
Spark 2.3 Quick Start
sbt can be really difficult for first time users, and it's ok not to fully understand all of the definitions. It will become clearer over time.
let me first simplify you build.sbt. it contains some unnecessary parts, and will be easier to explain without them
name := "NQueen"
version := "1.0"
scalaVersion := "2.11.8"
mainClass in Compile := Some("NQueen")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" %% "spark-mllib" % "2.3.0"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.firs
}
and for your questions:
Is .sbt file is a in scala or in sbt proprietary language?
well, it's both. you can do most scala operations in an .sbt file. you can import and use external dependencies, write custom code, etc.. but some things you can't do (define classes for example).
It's also might look as a dedicated different language, but in reality, it's just a DSL written in scala (:=, in, %%, % are all function written in scala)
libraryDependencies is a reserved keyword of type scala.collection.Seq using which sbt downloads and packages them as part of the output jar?
libraryDependencies is not a reserved keyword. you can think of it as a way to configure you project.
writing libraryDependencies := Seq(..) you basically setting the value of libraryDependencies.
But you are right about the meaning. it is a list of dependencies that should be downloaded.
Concatenating and creating the full library name including version. I suppose I need to look into respective documentations to find out what to specify.
keep in mind that %% and % are functions. you use those functions to specify what modules should be downloaded and added to the classpath.
you can find many dependencies (and their versions) in mvnrepository.
for example, for spark: https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.3.0
Please explain what this block does and where I can find the explanations.
assemblyMergeStrategy is a setting coming from the sbt-assembly plugin.
that plugin allows you to pack your application into a single jar with all the dependencies.
you can read about the merge strategy here: https://github.com/sbt/sbt-assembly#merge-strategy

sbt assembly, including my jar

I want to build a 'fat' jar of my code. I understand how to do this mostly but all the examples I have use the idea that the jar is not local and I am not sure how to include into my assembled jar another JAR that I built that the scala code uses. Like what folder does this JAR I have to include reside in?
Normally when I run my current code as a test using spark-shell it looks like this:
spark-shell --jars magellan_2.11-1.0.6-SNAPSHOT.jar -i st_magellan_abby2.scala
(the jar file is right in the same path as the .scala file)
So now I want to build a build.sbt file that does the same and includes that SNAPSHOT.jar file?
name := "PSGApp"
version := "1.0"
scalaVersion := "2.11.8"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
//provided means don't included it is there. already on cluster?
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0" % "provided",
//add magellan here somehow?
)
So where would I put the jar in the SBT project folder structure so it gets picked up when I run sbt assembly? Is that in the main/resources folder? Which the reference manual says is where 'files to include in the main jar' go?
What would I put in the libraryDependencies here so it knows to add that specific jar and not go out into the internet to get it?
One last thing, I was also doing some imports in my test code that doesn't seem to fly now that I put this code in an object with a def main attached to it.
I had things like:
import sqlContext.implicits._ which was right in the code above where it was about to be used like so:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.udf
val distance =udf {(a: Point, b: Point) =>
a.withinCircle(b, .001f); //current radius set to .0001
}
I am not sure can I just keep these imports inside the def main? or do I have to move them elsewhere somehow? (Still learning scala and wrangling the scoping I guess).
One way is to build your fat jar using the assembly plugin (https://github.com/sbt/sbt-assembly) locally and publishLocal to store the resulting jar into your local ivy2 cache
This will make it available for inclusion in your other project based on build.sbt settings in this project, eg:
name := "My Project"
organization := "org.me"
version := "0.1-SNAPSHOT"
Will be locally available as "org.me" %% "my-project" % "0.1-SNAPSHOT"
SBT will search local cache before trying to download from external repo.
However, this is considered bad practise, because only final project should ever be a fat-jar. You should never include one as dependency (many headaches).
There is no reason to make project magellan a fat-jar if library is included in PGapp. Just publishLocal without assembly
Another way is to make projects dependant on each other as code, not library.
lazy val projMagellan = RootProject("../magellan")
lazy val projPSGApp = project.in(file(".")).dependsOn(projMagellan)
This makes compilation in projPSGApp tigger compilation in projMagellan.
It depends on your use case though.
Just don't get in a situation where you have to manage your .jar manually
The other question:
import sqlContext.implicits._ should always be included in the scope where dataframe actions are required, so you shouldn't put that import near the other ones in the header
Update
Based on discussion in comments, my advise would be:
Get the magellan repo
git clone git#github.com:harsha2010/magellan.git
Create a branch to work on, eg.
git checkout -b new-stuff
Change the code you want
Then update the versioning number, eg.
version := "1.0.7-SNAPSHOT"
Publish locally
sbt publishLocal
You'll see something like (after a while):
[info] published ivy to /Users/tomlous/.ivy2/local/harsha2010/magellan_2.11/1.0.7-SNAPSHOT/ivys/ivy.xml
Go to your other project
Change build.sbt to include
"harsha2010" %% "magellan" % "1.0.7-SNAPSHOT" in your libraryDependencies
Now you have a good (temp) reference to your library.
Your PSGApp should be build as an fat jar assembly to pass to Spark
sbt clean assembly
This will pull in the custom build jar
If the change in the magellan project is usefull for the rest of the world, you should push your changes and create a pull request, so that in the future you can just include the latest build of this library

Akka migration from 2.0 to 2.1

I have started working with Actors, and was following a simple example as mention in the Getting Started Guide.
Specs:
Scala Version: 2.9.2
Akka Version: 2.0
I ran the example and it ran well. Then I changed by sbt build script to:
name := "PracAkka"
scalaVersion := "2.9.2"
resolvers += "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/"
libraryDependencies += "com.typesafe.akka" % "akka-actor_2.10" % "2.1.2"
i.e. I started using Akka 2.1.2. There were small changes and as per the migration guide, I made the respective changes. But still I am getting the below error:
class file needed by Props is missing. reference type ClassTag of package reflect refers to nonexisting symbol.
What do I need to change?
The documentation is quite clear I'd say: http://doc.akka.io/docs/akka/2.1.2/project/migration-guide-2.0.x-2.1.x.html
(I.e. Akka 2.1.x is for Scala 2.10)
From the sbt documentation
If you use groupID %% artifactID % revision rather than groupID %
artifactID % revision (the difference is the double %% after the
groupID), sbt will add your project's Scala version to the artifact
name. This is just a shortcut. You could write this without the %%:
libraryDependencies += "org.scala-tools" % "scala-stm_2.9.1" % "0.3"
Assuming the scalaVersion for your build is 2.9.1, the following is
identical:
libraryDependencies += "org.scala-tools" %% "scala-stm" % "0.3"
The idea is that many dependencies are compiled for multiple Scala
versions, and you'd like to get the one that matches your project.
Your script seems to be getting Akka for scala 2.10 in "akka-actor_2.10" try renaming to your scala version.