Building jars properly with sbt - scala

I have a map reduce .scala file like this:
import org.apache.spark._
object WordCount {
def main(args: Array[String]){
val inputDir = args(0)
//val inputDir = "/Users/eksi/Desktop/sherlock.txt"
val outputDir = args(1)
//val outputDir = "/Users/eksi/Desktop/out.txt"
val cnf = new SparkConf().setAppName("Example MapReduce Spark Job")
val sc = new SparkContext(cnf)
val textFile = sc.textFile(inputDir)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(outputDir)
sc.stop()
}
}
When I run my code, with setMaster("local[1]") parameters it works fine.
I want to put this code in a .jar and throw it to S3 to work with AWS EMR. Therefore, I use the following build.sbt to do so.
name := "word-count"
version := "0.0.1"
scalaVersion := "2.11.7"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.0.2"
)
It generates a jar file, however none of my scala code is in there. What I see is just a manifest file when I extract the .jar
When I run sbt package this is what I get:
[myMacBook-Pro] > sbt package
[info] Loading project definition from /Users/lele/bigdata/wordcount/project
[info] Set current project to word-count (in build file:/Users/lele/bigdata/wordcount/)
[info] Packaging /Users/lele/bigdata/wordcount/target/scala-2.11/word-count_2.11-0.0.1.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Jul 27, 2016 10:33:26 PM
What should I do to create a proper jar file that works like
WordCount.jar WordCount

Ref: It generates a jar file, however none of my scala code is in there. What I see is just a manifest file when I extract the .jar
Make sure your WordCount.scala is in the root or in src/main/scala
From http://www.scala-sbt.org/1.0/docs/Directories.html
Source code can be placed in the project’s base directory as with hello/hw.scala. However, most people don’t do this for real projects; too much clutter.
sbt uses the same directory structure as Maven for source files by default (all paths are relative to the base directory):

Related

sbt: generating shared sources in cross-platform project

Building my project on Scala with sbt, I want to have a task that will run prior to actual Scala compilation and will generate a Version.scala file with project version information. Here's a task I've came up with:
lazy val generateVersionTask = Def.task {
// Generate contents of Version.scala
val contents = s"""package io.kaitai.struct
|
|object Version {
| val name = "${name.value}"
| val version = "${version.value}"
|}
|""".stripMargin
// Update Version.scala file, if needed
val file = (sourceManaged in Compile).value / "version" / "Version.scala"
println(s"Version file generated: $file")
IO.write(file, contents)
Seq(file)
}
This task seems to work, but the problem is how to plug it in, given that it's a cross project, targeting Scala/JVM, Scala/JS, etc.
This is how build.sbt looked before I started touching it:
lazy val root = project.in(file(".")).
aggregate(fooJS, fooJVM).
settings(
publish := {},
publishLocal := {}
)
lazy val foo = crossProject.in(file(".")).
settings(
name := "foo",
version := sys.env.getOrElse("CI_VERSION", "0.1"),
// ...
).
jvmSettings(/* JVM-specific settings */).
jsSettings(/* JS-specific settings */)
lazy val fooJVM = foo.jvm
lazy val fooJS = foo.js
and, on the filesystem, I have:
shared/ — cross-platform code shared between JS/JVM builds
jvm/ — JVM-specific code
js/ — JS-specific code
The best I've came up so far with is adding this task to foo crossProject:
lazy val foo = crossProject.in(file(".")).
settings(
name := "foo",
version := sys.env.getOrElse("CI_VERSION", "0.1"),
sourceGenerators in Compile += generateVersionTask.taskValue, // <== !
// ...
).
jvmSettings(/* JVM-specific settings */).
jsSettings(/* JS-specific settings */)
This works, but in a very awkward way, not really compatible with "shared" codebase. It generates 2 distinct Version.scala files for JS and JVM:
sbt:root> compile
Version file generated: /foo/js/target/scala-2.12/src_managed/main/version/Version.scala
Version file generated: /foo/jvm/target/scala-2.12/src_managed/main/version/Version.scala
Naturally, it's impossible to access contents of these files from shared, and this is where I want to access it.
So far, I've came with a very sloppy workaround:
There is a var declared in singleton object in shared
in both JVM and JS main entry points, the very first thing I do is that I assign that variable to match constants defined in Version.scala
Also, I've tried the same trick with sbt-buildinfo plugin — the result is exactly the same, it generated per-platform BuildInfo.scala, which I can't use directly from shared sources.
Are there any better solutions available?
Consider pointing sourceManaged to shared/src/main/scala/src_managed directory and scoping generateVersionTask to the root project like so
val sharedSourceManaged = Def.setting(
baseDirectory.value / "shared" / "src" / "main" / "scala" / "src_managed"
)
lazy val root = project.in(file(".")).
aggregate(fooJS, fooJVM).
settings(
publish := {},
publishLocal := {},
sourceManaged := sharedSourceManaged.value,
sourceGenerators in Compile += generateVersionTask.taskValue,
cleanFiles += sharedSourceManaged.value
)
Now sbt compile should output something like
Version file generated: /Users/mario/IdeaProjects/scalajs-cross-compile-example/shared/src/main/scala/src_managed/version/Version.scala
...
[info] Compiling 3 Scala sources to /Users/mario/IdeaProjects/scalajs-cross-compile-example/js/target/scala-2.12/classes ...
[info] Compiling 1 Scala source to /Users/mario/IdeaProjects/scalajs-cross-compile-example/target/scala-2.12/classes ...
[info] Compiling 3 Scala sources to /Users/mario/IdeaProjects/scalajs-cross-compile-example/jvm/target/scala-2.12/classes ...

sbt: finding correct path to files/folders under resources directory

I've a simple project structure:
WordCount
|
|------------ project
|----------------|---assembly.sbt
|
|------------ resources
|------------------|------ Message.txt
|
|------------ src
|--------------|---main
|--------------------|---scala
|--------------------------|---org
|-------------------------------|---apache
|----------------------------------------|---spark
|----------------------------------------------|---Counter.scala
|
|------------ build.sbt
here's how Counter.scala looks:
package org.apache.spark
object Counter {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf())
val path: String = getClass.getClassLoader.getResource("Message.txt").getPath
println(s"path = $path")
// val lines = sc.textFile(path)
// val wordsCount = lines
// .flatMap(line => line.split("\\s", 2))
// .map(word => (word, 1))
// .reduceByKey(_ + _)
//
// wordsCount.foreach(println)
}
}
notice that the commented lines are actually correct, but the path variable is not. After building the fat jar with sbt assembly and running it with spark-submit, to see the value of path, I get:
path = file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt
you can see that path is assigned to the jar location and, mysteriously, followed by !/ and then the file name Message.txt!!
on the other hand when I'm inside the WordCount folder, and I run the repl sbt console and then write
scala> getClass.getClassLoader.getResource("Message.txt").getPath
I get the correct path (without the file:/ prefix)
res1: String = /home/me/WordCount/target/scala-2.11/classes/Message.txt
Question:
1 - why is there two different outputs from the same command? (i.e. getClass.getClassLoader.getResource("...").getPath)
2 - how can I use the correct path, which appears in the console, inside my source file Counter.scala?
for anyone who wants to try it, here's my build.sbt:
name := "Counter"
version := "0.1"
scalaVersion := "2.11.8"
resourceDirectory in Compile := baseDirectory.value / "resources"
// allows us to include spark packages
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += "Typesafe Simple Repository" at "http://repo.typesafe.com/typesafe/simple/maven-releases/"
resolvers += "MavenRepository" at "https://mvnrepository.com/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" % "provided"
and the spark-submit command is:
spark-submit --master local --deploy-mode client --class org.apache.spark.Counter /home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar
1 - why is there two different outputs from the same command?
By command, I am assuming you mean getClass.getClassLoader.getResource("Message.txt").getPath. So I would rephrase the question as why does the same method call to classloader getResource(...) return two different result depending on sbt console vs spark-submit.
The answer is because they use different classloader with each having different classpath. console uses your directories as classpath while spark-submit uses the fat JAR, which includes resources. When a resource is found in a JAR, the classloader returns a JAR URL, which looks like jar:file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt.
The whole point of using Apache Spark is to distribute some work across multiple computers, so I don't think you want to see your machine's local path in production.

Manually invoke ScalaPB compiler in an SBT task

I'm using ScalaPB to synthesize Scala classes for converting my data to and from Protobuf representation. By default, the SBT setup hooks into sbt compile to generate the files under the target folder.
Because I expect my .proto files to change very infrequently, I would rather manually invoke the ScalaPB process when they do, and keep the generated files under version control. This is the same approach I use for Slick's code generation functionality.
I can do something like:
lazy val genProto = TaskKey[Unit]("gen-proto", "Generate Scala classes from a proto file")
genProto := {
val protoSources = ...
val outputDirectory = ...
// ? run the same process
}
But I'm not sure how to invoke the process from SBT with custom inputs and outputs.
My latest attempt:
ScalaPbPlugin.runProtoc in ScalaPbPlugin.protobufConfig := (args =>
com.github.os72.protocjar.Protoc.runProtoc("-v261" +: args.toArray))
lazy val genProto = TaskKey[Unit]("gen-proto", "Generate Scala classes from a proto file")
genProto := {
val protoSourceDirectory = sourceDirectory.value / "main" / "protobuf"
val outputDirectory = (scalaSource in Compile).value / outputProtoDirectory
val schemas = (protoSourceDirectory ** "*.proto").get.map(_.getAbsoluteFile)
val includeOption = Seq(s"-I$protoSourceDirectory")
val outputOption = Seq(s"--scala_out=${outputDirectory.absolutePath}")
val options = schemas.map(_.absolutePath) ++ includeOption ++ outputOption
(ScalaPbPlugin.runProtoc in ScalaPbPlugin.protobufConfig).value(options)
(outputDirectory ** "*.scala").get.toSet
}
I get the following error:
> genProto
protoc-jar: protoc version: 261, detected platform: mac os x/x86_64
protoc-jar: executing: [/var/folders/lj/_85rbyf5525d3ktt666yjztr0000gn/T/protoc2879794465962204787.exe, /Users/alan/projects/causality/src/main/protobuf/lotEventStoreModel.proto, -I/Users/alan/projects/causality/src/main/protobuf, --scala_out=/Users/alan/projects/causality/src/main/scala/net/artsy/auction/protobuf]
protoc-gen-scala: program not found or is not executable
--scala_out: protoc-gen-scala: Plugin failed with status code 1.
[success] Total time: 0 s, completed Apr 25, 2016 9:39:09 AM
import sbt._
import Keys._
lazy val genProto = TaskKey[Unit]("gen-proto", "Generate Scala classes from a proto file")
genProto := {
Seq("/path/to/scalapbc-0.5.24/bin/scalapbc",
"src/main/protobuf/test.proto",
"--scala_out=src/main/scala/") !
}

How to include file in production mode for Play framework

An overview of my environments:
Mac OS Yosemite, Play framework 2.3.7, sbt 0.13.7, Intellij Idea 14, java 1.8.0_25
I tried to run a simple Spark program in Play framework, so I just create a Play 2 project in Intellij, and change some files as follows:
app/Controllers/Application.scala:
package controllers
import play.api._
import play.api.libs.iteratee.Enumerator
import play.api.mvc._
object Application extends Controller {
def index = Action {
Ok(views.html.index("Your new application is ready."))
}
def trySpark = Action {
Ok.chunked(Enumerator(utils.TrySpark.runSpark))
}
}
app/utils/TrySpark.scala:
package utils
import org.apache.spark.{SparkContext, SparkConf}
object TrySpark {
def runSpark: String = {
val conf = new SparkConf().setAppName("trySpark").setMaster("local[4]")
val sc = new SparkContext(conf)
val data = sc.textFile("public/data/array.txt")
val array = data.map ( line => line.split(' ').map(_.toDouble) )
val sum = array.first().reduce( (a, b) => a + b )
return sum.toString
}
}
public/data/array.txt:
1 2 3 4 5 6 7
conf/routes:
GET / controllers.Application.index
GET /spark controllers.Application.trySpark
GET /assets/*file controllers.Assets.at(path="/public", file)
build.sbt:
name := "trySpark"
version := "1.0"
lazy val `tryspark` = (project in file(".")).enablePlugins(PlayScala)
scalaVersion := "2.10.4"
libraryDependencies ++= Seq( jdbc , anorm , cache , ws,
"org.apache.spark" % "spark-core_2.10" % "1.2.0")
unmanagedResourceDirectories in Test <+= baseDirectory ( _ /"target/web/public/test" )
I type activator run to run this app in development mode then type localhost:9000/spark in the browser, it shows result 28 as expected. However, when I want type activator start to run this app in production mode it shows the following error message:
[info] play - Application started (Prod)
[info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000
[error] application -
! #6kik15fee - Internal server error, for (GET) [/spark] ->
play.api.Application$$anon$1: Execution exception[[InvalidInputException: Input path does not exist: file:/Path/to/my/project/target/universal/stage/public/data/array.txt]]
at play.api.Application$class.handleError(Application.scala:296) ~[com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at play.api.DefaultApplication.handleError(Application.scala:402) [com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:205) [com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:202) [com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [org.scala-lang.scala-library-2.10.4.jar:na]
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Path/to/my/project/target/universal/stage/public/data/array.txt
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.2.0.jar:na]
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.2.0.jar:na]
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) ~[org.apache.spark.spark-core_2.10-1.2.0.jar:1.2.0]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) ~[org.apache.spark.spark-core_2.10-1.2.0.jar:1.2.0]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) ~[org.apache.spark.spark-core_2.10-1.2.0.jar:1.2.0]
It seems that my array.txt file is not loaded in the production mode. How can solve this problem?
The problem here is that the public directory will not be available in your root project dir when you run in production. It is packaged as a jar (usually in STAGE_DIR/lib/PROJ_NAME-VERSION-assets.jar) so you will not be able to access them this way.
I can see two solutions here:
1) Place the file in the conf directory. This will work, but seems very dirty especially if you intend to use more data files;
2) Place those files in some directory and tell sbt to package it as well. You can keep using the public directory although it seems better to use a different dir especially if you would want to have many more files.
Supposing array.txt is placed in a dir named datafiles in your project root, you can add this to build.sbt:
mappings in Universal ++=
(baseDirectory.value / "datafiles" * "*" get) map
(x => x -> ("datafiles/" + x.getName))
Don't forget to change the paths in your app code:
// (...)
val data = sc.textFile("datafiles/array.txt")
Then just do a clean and when you run either start, stage or dist those files will be available.

How to get list of dependency jars from an sbt 0.10.0 project

I have a sbt 0.10.0 project that declares a few dependencies somewhat like:
object MyBuild extends Build {
val commonDeps = Seq("commons-httpclient" % "commons-httpclient" % "3.1",
"commons-lang" % "commons-lang" % "2.6")
val buildSettings = Defaults.defaultSettings ++ Seq ( organization := "org" )
lazy val proj = Project("proj", file("src"),
settings = buildSettings ++ Seq(
name := "projname",
libraryDependencies := commonDeps, ...)
...
}
I wish to creat a build rule to gather all the jar dependencies of "proj", so that I can symlink them to a single directory.
Thanks.
Example SBT task to print full runtime classpath
Below is roughly what I'm using. The "get-jars" task is executable from the SBT prompt.
import sbt._
import Keys._
object MyBuild extends Build {
// ...
val getJars = TaskKey[Unit]("get-jars")
val getJarsTask = getJars <<= (target, fullClasspath in Runtime) map { (target, cp) =>
println("Target path is: "+target)
println("Full classpath is: "+cp.map(_.data).mkString(":"))
}
lazy val project = Project (
"project",
file ("."),
settings = Defaults.defaultSettings ++ Seq(getJarsTask)
)
}
Other resources
Unofficial guide to sbt 0.10.
Keys.scala defines predefined keys. For example, you might want to replace fullClasspath with managedClasspath.
This plugin defines a simple command to generate an .ensime file, and may be a useful reference.