sbt: finding correct path to files/folders under resources directory - scala

I've a simple project structure:
WordCount
|
|------------ project
|----------------|---assembly.sbt
|
|------------ resources
|------------------|------ Message.txt
|
|------------ src
|--------------|---main
|--------------------|---scala
|--------------------------|---org
|-------------------------------|---apache
|----------------------------------------|---spark
|----------------------------------------------|---Counter.scala
|
|------------ build.sbt
here's how Counter.scala looks:
package org.apache.spark
object Counter {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf())
val path: String = getClass.getClassLoader.getResource("Message.txt").getPath
println(s"path = $path")
// val lines = sc.textFile(path)
// val wordsCount = lines
// .flatMap(line => line.split("\\s", 2))
// .map(word => (word, 1))
// .reduceByKey(_ + _)
//
// wordsCount.foreach(println)
}
}
notice that the commented lines are actually correct, but the path variable is not. After building the fat jar with sbt assembly and running it with spark-submit, to see the value of path, I get:
path = file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt
you can see that path is assigned to the jar location and, mysteriously, followed by !/ and then the file name Message.txt!!
on the other hand when I'm inside the WordCount folder, and I run the repl sbt console and then write
scala> getClass.getClassLoader.getResource("Message.txt").getPath
I get the correct path (without the file:/ prefix)
res1: String = /home/me/WordCount/target/scala-2.11/classes/Message.txt
Question:
1 - why is there two different outputs from the same command? (i.e. getClass.getClassLoader.getResource("...").getPath)
2 - how can I use the correct path, which appears in the console, inside my source file Counter.scala?
for anyone who wants to try it, here's my build.sbt:
name := "Counter"
version := "0.1"
scalaVersion := "2.11.8"
resourceDirectory in Compile := baseDirectory.value / "resources"
// allows us to include spark packages
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += "Typesafe Simple Repository" at "http://repo.typesafe.com/typesafe/simple/maven-releases/"
resolvers += "MavenRepository" at "https://mvnrepository.com/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" % "provided"
and the spark-submit command is:
spark-submit --master local --deploy-mode client --class org.apache.spark.Counter /home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar

1 - why is there two different outputs from the same command?
By command, I am assuming you mean getClass.getClassLoader.getResource("Message.txt").getPath. So I would rephrase the question as why does the same method call to classloader getResource(...) return two different result depending on sbt console vs spark-submit.
The answer is because they use different classloader with each having different classpath. console uses your directories as classpath while spark-submit uses the fat JAR, which includes resources. When a resource is found in a JAR, the classloader returns a JAR URL, which looks like jar:file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt.
The whole point of using Apache Spark is to distribute some work across multiple computers, so I don't think you want to see your machine's local path in production.

Related

Building jars properly with sbt

I have a map reduce .scala file like this:
import org.apache.spark._
object WordCount {
def main(args: Array[String]){
val inputDir = args(0)
//val inputDir = "/Users/eksi/Desktop/sherlock.txt"
val outputDir = args(1)
//val outputDir = "/Users/eksi/Desktop/out.txt"
val cnf = new SparkConf().setAppName("Example MapReduce Spark Job")
val sc = new SparkContext(cnf)
val textFile = sc.textFile(inputDir)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(outputDir)
sc.stop()
}
}
When I run my code, with setMaster("local[1]") parameters it works fine.
I want to put this code in a .jar and throw it to S3 to work with AWS EMR. Therefore, I use the following build.sbt to do so.
name := "word-count"
version := "0.0.1"
scalaVersion := "2.11.7"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.0.2"
)
It generates a jar file, however none of my scala code is in there. What I see is just a manifest file when I extract the .jar
When I run sbt package this is what I get:
[myMacBook-Pro] > sbt package
[info] Loading project definition from /Users/lele/bigdata/wordcount/project
[info] Set current project to word-count (in build file:/Users/lele/bigdata/wordcount/)
[info] Packaging /Users/lele/bigdata/wordcount/target/scala-2.11/word-count_2.11-0.0.1.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Jul 27, 2016 10:33:26 PM
What should I do to create a proper jar file that works like
WordCount.jar WordCount
Ref: It generates a jar file, however none of my scala code is in there. What I see is just a manifest file when I extract the .jar
Make sure your WordCount.scala is in the root or in src/main/scala
From http://www.scala-sbt.org/1.0/docs/Directories.html
Source code can be placed in the project’s base directory as with hello/hw.scala. However, most people don’t do this for real projects; too much clutter.
sbt uses the same directory structure as Maven for source files by default (all paths are relative to the base directory):

Using Specs2 in a Typesafe activator play application

I have used specs2 many times successfully in vanilla SBT projects. now I am starting to learn typesafe activator platform.
I did the following steps
activator new Shop just-play-scala
this is my build.sbt file
name := """Shop"""
version := "1.0-SNAPSHOT"
// Read here for optional jars and dependencies
libraryDependencies ++= Seq("org.specs2" %% "specs2-core" % "3.6.1" % "test")
resolvers += "scalaz-bintray" at "http://dl.bintray.com/scalaz/releases"
scalacOptions in Test ++= Seq("-Yrangepos")
lazy val root = project.in(file(".")).enablePlugins(PlayScala)
I created a file Shop/app/test/models/ShopSpec.scala
import org.specs2.mutable.Specification
class ShopSpec extends Specification {
def foo = s2"""
| This is a specification to check the 'Hello world' string
| The 'Hello world' string should
| contain 11 characters $e1
| start with 'Hello' $e2
| end with 'world' $e3
| """.stripMargin
def e1 = "Hello world" must haveSize(11)
def e2 = "Hello world" must startWith("Hello")
def e3 = "Hello world" must endWith("world")
}
When I run activator test I get an error
[success] Total time: 0 s, completed Jun 24, 2015 12:21:32 AM
Mohitas-MBP:Shop abhi$ activator test
[info] Loading project definition from /Users/abhi/ScalaProjects/Shop/project
[info] Set current project to Shop (in build file:/Users/abhi/ScalaProjects/Shop/)
**cannot create a JUnit XML printer. Please check that specs2-junit.jar is on the classpath**
org.specs2.reporter.JUnitXmlPrinter$
java.net.URLClassLoader.findClass(URLClassLoader.java:381)
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.jav
I have previously written spec2 test cases successfully when I was using SBT projects. but only when I use the typesafe activator that I get this issue with test cases.
I even changed the code of my test to something as simple as
import org.specs2.mutable.Specification
class ShopSpec extends Specification {
"A shop " should {
"create item" in {
failure
}
}
}
But still the same problem.
Wait .. I think I resolved it.
The activator play platform already has specs2 included so there is no need for me to tweak the built.sbt file for specs 2.
So I removed everything I had added to build.sbt file and left the file as
name := """Shop"""
version := "1.0-SNAPSHOT"
lazy val root = project.in(file(".")).enablePlugins(PlayScala)
Now it works fine. So basically, I don't need to add anything in a activator project for specs2.
I could have deleted the question... but leaving it here so that it can be of help to someone.
What worked for me was adding the following to build.sbt:
libraryDependencies ++= Seq("org.specs2" %% "specs2-core" % "3.6.2" % "test",
"org.specs2" %% "specs2-junit" % "3.6.2" % "test")

How to include file in production mode for Play framework

An overview of my environments:
Mac OS Yosemite, Play framework 2.3.7, sbt 0.13.7, Intellij Idea 14, java 1.8.0_25
I tried to run a simple Spark program in Play framework, so I just create a Play 2 project in Intellij, and change some files as follows:
app/Controllers/Application.scala:
package controllers
import play.api._
import play.api.libs.iteratee.Enumerator
import play.api.mvc._
object Application extends Controller {
def index = Action {
Ok(views.html.index("Your new application is ready."))
}
def trySpark = Action {
Ok.chunked(Enumerator(utils.TrySpark.runSpark))
}
}
app/utils/TrySpark.scala:
package utils
import org.apache.spark.{SparkContext, SparkConf}
object TrySpark {
def runSpark: String = {
val conf = new SparkConf().setAppName("trySpark").setMaster("local[4]")
val sc = new SparkContext(conf)
val data = sc.textFile("public/data/array.txt")
val array = data.map ( line => line.split(' ').map(_.toDouble) )
val sum = array.first().reduce( (a, b) => a + b )
return sum.toString
}
}
public/data/array.txt:
1 2 3 4 5 6 7
conf/routes:
GET / controllers.Application.index
GET /spark controllers.Application.trySpark
GET /assets/*file controllers.Assets.at(path="/public", file)
build.sbt:
name := "trySpark"
version := "1.0"
lazy val `tryspark` = (project in file(".")).enablePlugins(PlayScala)
scalaVersion := "2.10.4"
libraryDependencies ++= Seq( jdbc , anorm , cache , ws,
"org.apache.spark" % "spark-core_2.10" % "1.2.0")
unmanagedResourceDirectories in Test <+= baseDirectory ( _ /"target/web/public/test" )
I type activator run to run this app in development mode then type localhost:9000/spark in the browser, it shows result 28 as expected. However, when I want type activator start to run this app in production mode it shows the following error message:
[info] play - Application started (Prod)
[info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000
[error] application -
! #6kik15fee - Internal server error, for (GET) [/spark] ->
play.api.Application$$anon$1: Execution exception[[InvalidInputException: Input path does not exist: file:/Path/to/my/project/target/universal/stage/public/data/array.txt]]
at play.api.Application$class.handleError(Application.scala:296) ~[com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at play.api.DefaultApplication.handleError(Application.scala:402) [com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:205) [com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:202) [com.typesafe.play.play_2.10-2.3.7.jar:2.3.7]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [org.scala-lang.scala-library-2.10.4.jar:na]
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Path/to/my/project/target/universal/stage/public/data/array.txt
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.2.0.jar:na]
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.2.0.jar:na]
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) ~[org.apache.spark.spark-core_2.10-1.2.0.jar:1.2.0]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) ~[org.apache.spark.spark-core_2.10-1.2.0.jar:1.2.0]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) ~[org.apache.spark.spark-core_2.10-1.2.0.jar:1.2.0]
It seems that my array.txt file is not loaded in the production mode. How can solve this problem?
The problem here is that the public directory will not be available in your root project dir when you run in production. It is packaged as a jar (usually in STAGE_DIR/lib/PROJ_NAME-VERSION-assets.jar) so you will not be able to access them this way.
I can see two solutions here:
1) Place the file in the conf directory. This will work, but seems very dirty especially if you intend to use more data files;
2) Place those files in some directory and tell sbt to package it as well. You can keep using the public directory although it seems better to use a different dir especially if you would want to have many more files.
Supposing array.txt is placed in a dir named datafiles in your project root, you can add this to build.sbt:
mappings in Universal ++=
(baseDirectory.value / "datafiles" * "*" get) map
(x => x -> ("datafiles/" + x.getName))
Don't forget to change the paths in your app code:
// (...)
val data = sc.textFile("datafiles/array.txt")
Then just do a clean and when you run either start, stage or dist those files will be available.

sbt test:doc Could not find any member to link

I'm attempting to run sbt test:doc and I'm seeing a number of warnings similar to below:
[warn] /Users/tleese/code/my/stuff/src/test/scala/com/my/stuff/common/tests/util/NumberExtractorsSpecs.scala:9: Could not find any member to link for "com.my.stuff.common.util.IntExtractor".
The problem appears to be that Scaladoc references from test sources to main sources are not able to link correctly. Any idea what I might be doing wrong or need to configure?
Below are the relevant sections of my Build.scala:
val docScalacOptions = Seq("-groups", "-implicits", "-external-urls:[urls]")
scalacOptions in (Compile, doc) ++= docScalacOptions
scalacOptions in (Test, doc) ++= docScalacOptions
autoAPIMappings := true
Not sure if this is a satisfactory solution, but...
Scaladoc currently expects pairs of jar and URL to get the external linking to work. You can force sbt to link internal dependencies using JARs using exportJars. Compare the value of
$ show test:fullClasspath
before and after setting exportJars. Next, grab the name of the JAR that's being used and link it to the URL you'll be uploading it to.
scalaVersion := "2.11.0"
autoAPIMappings := true
exportJars := true
scalacOptions in (Test, doc) ++= Opts.doc.externalAPI((
file(s"${(packageBin in Compile).value}") -> url("http://example.com/")) :: Nil)
Now I see that test:doc a Scaladoc with links to http://example.com/index.html#foo.IntExtractor from my foo.IntExtractor.
Using ideas from Eugene's answer I made a following snippet.
It uses apiMapping sbt variable as adviced in sbt manual.
Unfortunately it doesn't tell how to deal with managed dependencies, even the subsection title says so.
// External documentation
/* You can print computed classpath by `show compile:fullClassPath`.
* From that list you can check jar name (that is not so obvious with play dependencies etc).
*/
val documentationSettings = Seq(
autoAPIMappings := true,
apiMappings ++= {
// Lookup the path to jar (it's probably somewhere under ~/.ivy/cache) from computed classpath
val classpath = (fullClasspath in Compile).value
def findJar(name: String): File = {
val regex = ("/" + name + "[^/]*.jar$").r
classpath.find { jar => regex.findFirstIn(jar.data.toString).nonEmpty }.get.data // fail hard if not found
}
// Define external documentation paths
Map(
findJar("scala-library") -> url("http://scala-lang.org/api/" + currentScalaVersion + "/"),
findJar("play-json") -> url("https://playframework.com/documentation/2.3.x/api/scala/index.html")
)
}
)
This is a modification of the answer by #phadej. Unfortunately, that answer only works on Unix/Linux because it assumes that the path separator is a /. On Windows, the path separator is \.
The following works on all platforms, and is slightly more idiomatic IMHO:
/* You can print the classpath with `show compile:fullClassPath` in the SBT REPL.
* From that list you can find the name of the jar for the managed dependency.
*/
lazy val documentationSettings = Seq(
autoAPIMappings := true,
apiMappings ++= {
// Lookup the path to jar from the classpath
val classpath = (fullClasspath in Compile).value
def findJar(nameBeginsWith: String): File = {
classpath.find { attributed: Attributed[java.io.File] => (attributed.data ** s"$nameBeginsWith*.jar").get.nonEmpty }.get.data // fail hard if not found
}
// Define external documentation paths
Map(
findJar("scala-library") -> url("http://scala-lang.org/api/" + currentScalaVersion + "/"),
findJar("play-json") -> url("https://playframework.com/documentation/2.3.x/api/scala/index.html")
)
}
)

SBT - get path to managed jars

I want to use some dependencies to perform code generation in Scala.
Example:
libraryDependencies += "org.jooq" % "jooq" % "2.4.0"
val jooqTask = jooq := {
val classpath = "jooq-2.4.0.jar;jooq-meta-2.4.0.jar;jooq-codegen-2.4.0.jar;."
val main = "org.jooq.util.GenerationTool"
"java -classpath %s %s /project/jooq-configuration.xml".format(classpath, main) !
}
However, I want to get the classpath of the dependencies, so I can actually run the Java process.
You can grab the classpath of your compile dependencies like this:
val jooqTask = jooq <<= managedClasspath in Compile map { cp =>
val classpath = Path.makeString(cp.files))
val main = "org.jooq.util.GenerationTool"
"java -classpath %s %s /project/jooq-configuration.xml".format(classpath, main) !
}
Note that the classpath does not include "." (aka current directory), though.