I am very new to spark. Please help me with below query:
Suppose I have different file path in dev and prod and i do not want to hardcode the same in my spark scala code rather I want to use variable in my code so my doubt is that where I define these variables for dev and prod and how to access them in my code.
Here are a few options:
Pass variables as command line parameters to your Spark application
Use the property/config file to keep those variables
Create environmental variables for these variables
For implementation, follow these articles.
Command Line Argument in Scala
Passing command line arguments to spark-shell
Passing command line arguments to spark-submit
Related
I am unable to get access to the environmental variables in either ~/.bashrc or ~/.profile from Scala. How do I access the environmental variables from a Scala Process? Also I am unable to update the paths like this:
Process("myProgram", None, "PATH"-> ".:/path/to/myProgram").!!
However this works:
Process("/path/to/myProgram",None).!!
Works fine. However, when myProgram depends on some environmental variables being set this doesnt work anymore.
How do I change the PATH variable from a Scala program?
And even better, how can I get Scala to access the environmental variables from .bashrc or .profile. Currently none of those are available.
Thanks for your time and help
How do I access the environmental variables from a Scala Process?
The util.Properties object offers 3 different methods for inspecting the environmental variables that the Scala process/program has inherited. Here's an example:
util.Properties.envOrNone("LC_COLLATE")
//res0: Option[String] = Some(POSIX)
How do I change the PATH variable from a Scala program?
A running process is not allowed to alter its own environment, but it can launch a new process with a modified environment. There are a couple different ways to go about this.
One is to launch the shell of your choice and use shell syntax to make the modifications before invoking the target command.
import sys.process._
Seq("sh", "-c", "PATH=$PATH:$HOME/Progs myProg").!!
// a single arg:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Or you can supply the environment mods as an argument to one of the many overloaded Process.apply() methods.
import scala.sys.process._
Process("./myProg"
,new java.io.File("/full/path/to")
,"PATH"->s"${Properties.envOrElse("PATH",".")}:/full/path/to"
).!!
...can I get Scala to access the environmental variables from .bashrc or .profile?
If your Scala program is launched from a shell with the proper environment then every process launched from your program should inherit the same. If, for whatever reason, your program has not inherited a fully equipped environment then the easiest thing to do is to launch a fully equipped shell to launch the target command.
import scala.sys.process._
Seq("sh", "-c" , ". $HOME/.bashrc && myProg").!!
The only two way I know to run Scala based spark code is to either compile a Scala program into a jar file and run it with spark-submit, or run a Scala script by using :load inside the spark-shell. My question is, it is possible to run a Scala file directly on the command line, without first going inside spark-shell and then issuing :load?
You can simply use the stdin redirection with spark-shell:
spark-shell < YourSparkCode.scala
This command starts a spark-shell, interprets your YourSparkCode.scala line by line and quits at the end.
Another option is to use -I <file> option of spark-shell command:
spark-shell -I YourSparkCode.scala
The only difference is that the latter command leaves you inside the shell and you must issue :quit command to close the session.
[UDP]
Passing parameters
Since spark-shell does not execute your source as an application but just interprets your source file line by line, you cannot pass any parameters directly as application arguments.
Fortunately, there may be a lot of options to approach the same (e.g, externalizing the parameters in another file and read it in the very beginning in your script).
But I personally find the Spark configuration the most clean and convenient way.
Your pass your parameters via --conf option:
spark-shell --conf spark.myscript.arg1=val1 --conf spark.yourspace.arg2=val2 < YourSparkCode.scala
(please note that spark. prefix in your property name is mandatory, otherwise Spark will discard your property as invalid)
And read these arguments in your Spark code as below:
val arg1: String = spark.conf.get("spark.myscript.arg1")
val arg2: String = spark.conf.get("spark.myscript.arg2")
It is possible via spark-submit.
https://spark.apache.org/docs/latest/submitting-applications.html
You can even put it to bash script either create sbt-task
https://www.scala-sbt.org/1.x/docs/Tasks.html
to run your code.
I'm searching for a simple solution for passing parameters to tests (using env vars, additional files not suitable. I need to pass values via command line)
Currently I have following solution:
Passing parameters via SBT_OPTS:
SBT_OPTS="-DparamName=value" sbt moduleName/test
And retrieving value in test:
Option(System.getProperty("myProperty")).getOrElse("defaultValue")
Unfortunately this solution doesn't fit any more. Are there any simple solutions like this, but without using SBT_OPTS?
Thanks.
Command:
sbt -Dparam=value module/test
Retrieving value:
sys.props.getOrElse("param", DEFAULT_VALUE)
Currently, I use export JAVA_OPTS ... on the command line, but there seem to be other possibilities, using the build.sbt or an external property file.
I have found several relevant github issues here, here and here but the many options are confusing. Is there a recommended approach?
The approach you take to setting JVM options depends mainly on your use case:
Inject options every time
If you want to be able to specify the options every time you run your service, the two mechanisms are environment variables, and command line parameters. Which you use is mostly a matter of taste or convenience (but command line parameters will override environment variable settings).
Environment variables
You can inject values using the JAVA_OPTS environment variable. This is specified as a sequence of parameters passed directly to the java binary, with each parameter separated by whitespace.
Command line parameters
You can inject values by adding command line parameters in either of two formats:
-Dkey=val
Passes a Java environment property into the java binary.
-J-X
Passes any flag -X to the java binary, stripping the leading -J.
Inject options from a file which can be modified
If you want to end up with a file on the filesystem which can be modified after install time, you will want to use sbt-native-packager's ability to read from a .ini file to initialise a default value for Java options. The details of this can be seen at http://www.scala-sbt.org/sbt-native-packager/archetypes/cheatsheet.html#file-application-ini-or-etc-default
Following the instructions, and depending on the archetype you are using, you will end up with a file at either /etc/default, application.ini, or another custom name, which will be read by the startup script to add settings.
Each line of this file are treated as if they were extra startup parameters, so the same rules as mentioned earlier are still enforced; e.g. -X flags need to be written as if they were -J-X.
Inject options & code which never need to be changed
You can hardcode changes directly into the shell script which is run to start your binary, by using the SBT setting bashScriptExtraDefines, and following the details at http://www.scala-sbt.org/sbt-native-packager/archetypes/cheatsheet.html#extra-defines
This is the most flexible option in terms of what is possible (you can write any valid bash code, and this is added to the start script). But it is also less flexible in that it is not modifiable afterwards; any optional calculations have to be described in terms of the bash scripting language.
I'd like to be able to modify some build tasks in response to command line parameters. How do I (can I?) access command line parameters from the Build.scala file?
I don't think it's possible and you should rather resort to using input tasks or environment properties with -D.