Read input file from jar while running application from spark-submit - scala

I have an input file that is custom delimited and is passed to newAPIHadoopFile to convert as RDD[String]. The file resides under the project resource directory. The following code works well when run from the Eclipse IDE.
val path = this.getClass()
.getClassLoader()
.getResource(fileName)
.toURI().toString()
val conf = new org.apache.hadoop.conf.Configuration()
conf.set("textinputformat.record.delimiter", recordDelimiter)
return sc.newAPIHadoopFile(
path,
classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],
conf)
.map(_._2.toString)
However when I run it on spark-submit (with a uber jar) as follows
spark-submit /Users/anon/Documents/myUber.jar
I get the below error.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/Users/anon/Documents/myUber.jar!/myhome-data.json
Any inputs please?

If the file is for sc.newAPIHadoopFile that requires a path not an input stream, I'd recommend using --files option of spark-submit.
--files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).
See SparkFiles.get method:
Get the absolute path of a file added through SparkContext.addFile().
With that, you should use spark-submit as follows:
spark-submit --files fileNameHere /Users/anon/Documents/myUber.jar
In a general case, if a file is inside a jar file, you should use InputStream to access the file (not as a File directly).
The code could look as follows:
val content = scala.io.Source.fromInputStream(
classOf[yourObject].getClassLoader.getResourceAsStream(yourFileNameHere)
See Scala's Source object and Java's ClassLoader.getResourceAsStream method.

Related

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
Spark code to read csv:
val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")
The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)
Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.
Any suggestions to resolve this error?
Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.
To access your --files use csv("#test_file.csv")
should not be copied into hdfs
Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI
Below solution worked for me:
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
To access file passed in spark-submit:
import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString
Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

Merging configurations for spark using typesafe library and extraJavaOptions

I'm trying to merge 2 config file (or create a config file based on a single reference file) using
lazy val finalConfig:
Option(System.getProperty("user.resource"))
.map(ConfigFactory.load)
.map(_.withFallback(ConfigFactory.load(System.getProperty("config.resource"))).resolve())
.getOrElse(ConfigFactory.load(System.getProperty("config.resource")))
I'm defining my java variable inside spark using spark-submit ....... --conf spark.driver.extraJavaOptions=-Dconfig.resource=./reference.conf,-Duser.resource=./user.conf ...
My goal is to be able to point a file that is not inside my jar to be used by System.getProperty("..") in my code. I changed the folder for testing (cd ..) and keep getting the same error so I guess spark doesn't care about my java arguments..?
Is there a way to point to a file (or even 2 files in my case) so that they can be merged?
I also tried to include the reference.conf file but not the user.conf file: it recognizes the reference.conf but not the user.conf that i gave with --conf spark.driver.extraJavaOptions=-Duser.resource=./user.conf .
Is there a way to do that? Thanks if you can help
I don't see you doing ConfigFactory.parseFile to loaded a file containing properties.
Typesafe automatically read any .properties file in the class path, all -D parameters passed in to the JVM and then merges them.
I am reading an external property file which is not part of the jar as following. The file "application.conf" is placed on the same directory where the jar is kept.
val applicationRootPath = System.getProperty("user.dir")
val config = Try {
ConfigFactory.parseFile(new File(applicationRootPath + "/" + "application.conf"))
}.getOrElse(ConfigFactory.empty())
appConfig = config.withFallback(ConfigFactory.load()).resolve
ConfigFactory.load() already contains all the properties present on the properties files in the class path and -d parameters. I am giving priority to my external "application.conf" and falling back on default values. For matching keys "application.conf" take precedence over other sources.

Spark Shell unable to read file at valid path

I am trying to read a file in Spark Shell that comes with CentOS distribution of Cloudera on my local machine. Following are the commands I have entered in Spark Shell.
spark-shell
val fileData = sc.textFile("hdfs://user/home/cloudera/cm_api.py");
fileData.count
I also tried this statment for reading file:
val fileData = sc.textFile("user/home/cloudera/cm_api.py");
However I am getting
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera/cm_api.py
I haven't changed any settings or configurations. What am I doing wrong?
You are missing the leading slash in your url, so the path is relative. To make it absolute, use
val fileData = sc.textFile("hdfs:///user/home/cloudera/cm_api.py")
or
val fileData = sc.textFile("/user/home/cloudera/cm_api.py")
I think you need to put the file in hdfs first: hadoop fs -put, then check the file: hadoop fs -ls, then go spark-shell , val fileData = sc.textFile("cm_api.py")
In "hdfs://user/home/cloudera/cm_api.py", you are missing the hostname of the URI. You should have pass something like "hdfs://<host>:<port>/user/home/cloudera/cm_api.py", where <host> is Hadoop NameNode host and the <port> is, well, port number of Hadoop NameNode, which is 50070 by default.
The error message says hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera/cm_api.py does not exist. The path looks suspicious! The file you mean is probably at hdfs://quickstart.cloudera:8020/user/cloudera/cm_api.py.
If it is, you can access it by using that full path. Or, if the default file system is configured as hdfs://quickstart.cloudera:8020/user/cloudera/, you can use simply cm_api.py.
You may be confused between HDFS file paths and local file paths. By specifying
hdfs://quickstart.cloudera:8020/user/home/cloudera/cm_api.py
you are saying two things:
1) there is a computer by the name "quickstart.cloudera' reachable via the network (try ping to ensure that is the case), and it is running HDFS.
2) the HDFS file system contains a file at /user/home/cloudera/cm_api.py (try 'hdfs dfs -ls /user/home/cloudera/' to verify this
If you are trying to access a file on the local file system you have to use a different URI:
file:///user/home/cloudera/cm_api.py

How to access static resources in jar (that correspond to src/main/resources folder)?

I have a Spark Streaming application built with Maven (as jar) and deployed with the spark-submit script. The application project layout follows the standard directory layout:
myApp
src
main
scala
com.mycompany.package
MyApp.scala
DoSomething.scala
...
resources
aPerlScript.pl
...
test
scala
com.mycompany.package
MyAppTest.scala
...
target
...
pom.xml
In the DoSomething.scala object I have a method (let's call it doSomething()) that tries to execute a Perl script -- aPerlScript.pl (from the resources folder) -- using scala.sys.process.Process and passing two arguments to the script (the first one is the absolute path to a binary file used as input, the second one is the path/name of the produced output file). I call then DoSomething.doSomething().
The issue is that I was not able to access the script, not with absolute paths, relative paths, getClass.getClassLoader.getResource, getClass.getResource, I have specified the resources folder in my pom.xml. None of my attempts succeeded. I don't know how to find the stuff I put in src/main/resources.
I will appreciate any help.
SIDE NOTES:
I use an external Process instead of a Spark pipe because, at this step of my workflow, I must handle binary files as input and output.
I'm using Spark-streaming 1.1.0, Scala 2.10.4 and Java 7. I build the jar with "Maven install" from within Eclipse (Kepler)
When I use the getClass.getClassLoader.getResource "standard" method to access resources I find that the actual classpath is the spark-submit script's one.
There are a few solutions. The simplest is to use Scala's process infrastructure:
import scala.sys.process._
object RunScript {
val arg = "some argument"
val stream = RunScript.getClass.getClassLoader.getResourceAsStream("aPerlScript.pl")
val ret: Int = (s"/usr/bin/perl - $arg" #< stream).!
}
In this case, ret is the return code for the process and any output from the process is directed to stdout.
A second (longer) solution is to copy the file aPerlScript.pl from the jar file to some temporary location and execute it from there. This code snippet should have most of what you need.
object RunScript {
// Set up copy destination from the Java temporary directory. This is /tmp on Linux
val destDir = System.getProperty("java.io.tmpdir") + "/"
// Get a stream to the script in the resources dir
val source = Channels.newChannel(RunScript.getClass.getClassLoader.getResourceAsStream("aPerlScript.pl"))
val fileOut = new File(destDir, "aPerlScript.pl")
val dest = new FileOutputStream(fileOut)
// Copy file to temporary directory
dest.getChannel.transferFrom(source, 0, Long.MaxValue)
source.close()
dest.close()
}
// Schedule the file for deletion for when the JVM quits
sys.addShutdownHook {
new File(destDir, "aPerlScript.pl").delete
}
// Now you can execute the script.
This approach allows you to bundle native libraries in JAR files. Copying them out allows the libraries to be loaded at runtime for whatever JNI mischief you have planned.

Loading files from JAR in Scala

I have the following code structure:
Projects/
classes/
performance/AcPerformance.class
resources/
Aircraft/
allAircraft.txt
I have the contents of the classes folder in a JAR and my AcPerformance scala code is trying to read the contents of the Aircraft folder text files. My code:
val AircraftPerf = getClass.getResource("resources/Aircraft").getFile
val dataDir = new File(AircraftPerf)
val acFile = new File(dataDir, "allAircraft.txt")
for (line <- linesFromResource(acFile)) {
// read in lines
}
When I try to run the code I get the following error:
Caused by: java.io.FileNotFoundException: C:\Projects\file:\C:\Projects\libraries\aircraft.jar!\Aircraft\allAircraft.txt (The filename, directory name, or volume label syntax is incorrect)
Is this the correct way to read the contents of a JAR? THanks!
No, URL's getFile isn't going to do what you want here—the path it gives you isn't a file system path that you could use in a File constructor. You'd be best off using getResourceAsStream and the full path to the resource:
val in = getClass.getResourceAsStream("/resources/Aircraft/allAircraft.txt")
Note that you also need to preface the path with / to make it absolute—in your current version you're looking for a resources directory under performance.