How to access static resources in jar (that correspond to src/main/resources folder)? - scala

I have a Spark Streaming application built with Maven (as jar) and deployed with the spark-submit script. The application project layout follows the standard directory layout:
myApp
src
main
scala
com.mycompany.package
MyApp.scala
DoSomething.scala
...
resources
aPerlScript.pl
...
test
scala
com.mycompany.package
MyAppTest.scala
...
target
...
pom.xml
In the DoSomething.scala object I have a method (let's call it doSomething()) that tries to execute a Perl script -- aPerlScript.pl (from the resources folder) -- using scala.sys.process.Process and passing two arguments to the script (the first one is the absolute path to a binary file used as input, the second one is the path/name of the produced output file). I call then DoSomething.doSomething().
The issue is that I was not able to access the script, not with absolute paths, relative paths, getClass.getClassLoader.getResource, getClass.getResource, I have specified the resources folder in my pom.xml. None of my attempts succeeded. I don't know how to find the stuff I put in src/main/resources.
I will appreciate any help.
SIDE NOTES:
I use an external Process instead of a Spark pipe because, at this step of my workflow, I must handle binary files as input and output.
I'm using Spark-streaming 1.1.0, Scala 2.10.4 and Java 7. I build the jar with "Maven install" from within Eclipse (Kepler)
When I use the getClass.getClassLoader.getResource "standard" method to access resources I find that the actual classpath is the spark-submit script's one.

There are a few solutions. The simplest is to use Scala's process infrastructure:
import scala.sys.process._
object RunScript {
val arg = "some argument"
val stream = RunScript.getClass.getClassLoader.getResourceAsStream("aPerlScript.pl")
val ret: Int = (s"/usr/bin/perl - $arg" #< stream).!
}
In this case, ret is the return code for the process and any output from the process is directed to stdout.
A second (longer) solution is to copy the file aPerlScript.pl from the jar file to some temporary location and execute it from there. This code snippet should have most of what you need.
object RunScript {
// Set up copy destination from the Java temporary directory. This is /tmp on Linux
val destDir = System.getProperty("java.io.tmpdir") + "/"
// Get a stream to the script in the resources dir
val source = Channels.newChannel(RunScript.getClass.getClassLoader.getResourceAsStream("aPerlScript.pl"))
val fileOut = new File(destDir, "aPerlScript.pl")
val dest = new FileOutputStream(fileOut)
// Copy file to temporary directory
dest.getChannel.transferFrom(source, 0, Long.MaxValue)
source.close()
dest.close()
}
// Schedule the file for deletion for when the JVM quits
sys.addShutdownHook {
new File(destDir, "aPerlScript.pl").delete
}
// Now you can execute the script.
This approach allows you to bundle native libraries in JAR files. Copying them out allows the libraries to be loaded at runtime for whatever JNI mischief you have planned.

Related

Merging configurations for spark using typesafe library and extraJavaOptions

I'm trying to merge 2 config file (or create a config file based on a single reference file) using
lazy val finalConfig:
Option(System.getProperty("user.resource"))
.map(ConfigFactory.load)
.map(_.withFallback(ConfigFactory.load(System.getProperty("config.resource"))).resolve())
.getOrElse(ConfigFactory.load(System.getProperty("config.resource")))
I'm defining my java variable inside spark using spark-submit ....... --conf spark.driver.extraJavaOptions=-Dconfig.resource=./reference.conf,-Duser.resource=./user.conf ...
My goal is to be able to point a file that is not inside my jar to be used by System.getProperty("..") in my code. I changed the folder for testing (cd ..) and keep getting the same error so I guess spark doesn't care about my java arguments..?
Is there a way to point to a file (or even 2 files in my case) so that they can be merged?
I also tried to include the reference.conf file but not the user.conf file: it recognizes the reference.conf but not the user.conf that i gave with --conf spark.driver.extraJavaOptions=-Duser.resource=./user.conf .
Is there a way to do that? Thanks if you can help
I don't see you doing ConfigFactory.parseFile to loaded a file containing properties.
Typesafe automatically read any .properties file in the class path, all -D parameters passed in to the JVM and then merges them.
I am reading an external property file which is not part of the jar as following. The file "application.conf" is placed on the same directory where the jar is kept.
val applicationRootPath = System.getProperty("user.dir")
val config = Try {
ConfigFactory.parseFile(new File(applicationRootPath + "/" + "application.conf"))
}.getOrElse(ConfigFactory.empty())
appConfig = config.withFallback(ConfigFactory.load()).resolve
ConfigFactory.load() already contains all the properties present on the properties files in the class path and -d parameters. I am giving priority to my external "application.conf" and falling back on default values. For matching keys "application.conf" take precedence over other sources.

Read input file from jar while running application from spark-submit

I have an input file that is custom delimited and is passed to newAPIHadoopFile to convert as RDD[String]. The file resides under the project resource directory. The following code works well when run from the Eclipse IDE.
val path = this.getClass()
.getClassLoader()
.getResource(fileName)
.toURI().toString()
val conf = new org.apache.hadoop.conf.Configuration()
conf.set("textinputformat.record.delimiter", recordDelimiter)
return sc.newAPIHadoopFile(
path,
classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],
conf)
.map(_._2.toString)
However when I run it on spark-submit (with a uber jar) as follows
spark-submit /Users/anon/Documents/myUber.jar
I get the below error.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/Users/anon/Documents/myUber.jar!/myhome-data.json
Any inputs please?
If the file is for sc.newAPIHadoopFile that requires a path not an input stream, I'd recommend using --files option of spark-submit.
--files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).
See SparkFiles.get method:
Get the absolute path of a file added through SparkContext.addFile().
With that, you should use spark-submit as follows:
spark-submit --files fileNameHere /Users/anon/Documents/myUber.jar
In a general case, if a file is inside a jar file, you should use InputStream to access the file (not as a File directly).
The code could look as follows:
val content = scala.io.Source.fromInputStream(
classOf[yourObject].getClassLoader.getResourceAsStream(yourFileNameHere)
See Scala's Source object and Java's ClassLoader.getResourceAsStream method.

How to do File creation and manipulation in functional style?

I need to write a program where I run a set of instructions and create a file in a directory. Once the file is created, when the same code block is run again, it should not run the same set of instructions since it has already been executed before, here the file is used as a guard.
var Directory: String = "Dir1"
var dir: File = new File("Directory");
dir.mkdir();
var FileName: String = Directory + File.separator + "samplefile" + ".log"
val FileObj: File = new File(FileName)
if(!FileObj.exists())
// blahblah
else
{
// set of instructions to create the file
}
When the programs runs initially, the file won't be present, so it should run the set of instructions in else and also create the file, and after the first run, the second run it should exit since the file exists.
The problem is that I do not understand new File, and when the file is created? Should I use file.CreateNewFile? Also, how to write this in functional style using case?
It's important to understand that a java.io.File is not a physical file on the file system, but a representation of a pathname -- per the javadoc: "An abstract representation of file and directory pathnames". So new File(...) has nothing to do with creating an actual file - you are just defining a pathname, which may or may not correspond to an existing file.
To create an empty file, you can use:
val file = new File("filepath/filename")
file.createNewFile();
If running on JRE 7 or higher, you can use the new java.nio.file API:
val path = Paths.get("filepath/filename")
Files.createFile(path)
If you're not happy with the default IO APIs, you an consider a number of alternative. Scala-specific ones that I know of are:
scala-io
rapture.io
Or you can use libraries from the Java world, such as Google Guava or Apache Commons IO.
Edit: One thing I did not consider initially: I understood "creating a file" as "creating an empty file"; but if you intend to write something immediately in the file, you generally don't need to create an empty file first.

Can't load a file in Play (always not found)

I can't load a file in Play:
val filePath1 = "/views/layouts/mylayout.scala.html"
val filePath2 = "views/layouts/mylayout.scala.html"
Play.getExistingFile(filePath1)
Play.getExistingFile(filePath2)
Play.resourceAsStream(filePath1)
Play.resourceAsStream(filePath2)
None of these works, they all return None.
You are essentially trying to read a source file at runtime. Which is not something you should usually do. If you want to read a file at runtime then I'd recommend putting it somewhere that will end up in the classpath and then use Play.resourceAsStream to read the file. The files in the conf directory and non-compiled files in the app dir should end up in the classpath.

SBT task to extract the jar created by package-src somewhere

I would like to create a task that depends on another task with a jar file as output (e.g. package-src), which then extracts the resulting jar somewhere?
Note: I'm not interested in the library/method used to perform the extraction, only how I would define a task that invokes such a library or method.
The relevant documentation pages are Tasks and TaskInputs. For unzipping, you can use sbt.IO.unzip(...).
First, we need to define the task's key (in a .scala build definition). The task is going to return the set of unzipped files.
val unzipPackage = TaskKey[Set[File]]("unzip-package", "unzips the JAR generated by package-src")
Then, we add a setting like this one:
unzipPackage <<= (packageSrc, target in unzipPackage, streams) map { (zipFile, dir, out) =>
IO createDirectory dir
val unzippedFiles = IO unzip (zipFile, dir, AllPassFilter)
out.log.info("Unzipped %d files of %s to %s" format (unzippedFiles size, zipFile, dir))
unzippedFiles
}
This let's us define the output directory as a setting, too:
target in unzipPackage <<= target / "unzippedPackage"
Hope this helps.