I am using CDH 5.2. I am able to use spark-shell to run the commands.
How can I run the file(file.spark) which contain spark commands.
Is there any way to run/compile the scala programs in CDH 5.2 without sbt?
In command line, you can use
spark-shell -i file.scala
to run code which is written in file.scala
To load an external file from spark-shell simply do
:load PATH_TO_FILE
This will call everything in your file.
I don't have a solution for your SBT question though sorry :-)
You can use either sbt or maven to compile spark programs. Simply add the spark as dependency to maven
<repository>
<id>Spark repository</id>
<url>http://www.sparkjava.com/nexus/content/repositories/spark/</url>
</repository>
And then the dependency:
<dependency>
<groupId>spark</groupId>
<artifactId>spark</artifactId>
<version>1.2.0</version>
</dependency>
In terms of running a file with spark commands: you can simply do this:
echo"
import org.apache.spark.sql.*
ssc = new SQLContext(sc)
ssc.sql("select * from mytable").collect
" > spark.input
Now run the commands script:
cat spark.input | spark-shell
Just to give more perspective to the answers
Spark-shell is a scala repl
You can type :help to see the list of operation that are possible inside the scala shell
scala> :help
All commands can be abbreviated, e.g., :he instead of :help.
:edit <id>|<line> edit history
:help [command] print this summary or command-specific help
:history [num] show the history (optional num is commands to show)
:h? <string> search the history
:imports [name name ...] show import history, identifying sources of names
:implicits [-v] show the implicits in scope
:javap <path|class> disassemble a file or class name
:line <id>|<line> place line(s) at the end of history
:load <path> interpret lines in a file
:paste [-raw] [path] enter paste mode or paste a file
:power enable power user mode
:quit exit the interpreter
:replay [options] reset the repl and replay all previous commands
:require <path> add a jar to the classpath
:reset [options] reset the repl to its initial state, forgetting all session entries
:save <path> save replayable session to a file
:sh <command line> run a shell command (result is implicitly => List[String])
:settings <options> update compiler options, if possible; see reset
:silent disable/enable automatic printing of results
:type [-v] <expr> display the type of an expression without evaluating it
:kind [-v] <expr> display the kind of expression's type
:warnings show the suppressed warnings from the most recent line which had any
:load interpret lines in a file
Tested on both spark-shell version 1.6.3 and spark2-shell version 2.3.0.2.6.5.179-4, you can directly pipe to the shell's stdin like
spark-shell <<< "1+1"
or in your use case,
spark-shell < file.spark
You can run as you run your shell script.
This example to run from command line environment
example
./bin/spark-shell :- this is the path of your spark-shell under bin
/home/fold1/spark_program.py :- This is the path where your python program is there.
So:
./bin.spark-shell /home/fold1/spark_prohram.py
Related
The only two way I know to run Scala based spark code is to either compile a Scala program into a jar file and run it with spark-submit, or run a Scala script by using :load inside the spark-shell. My question is, it is possible to run a Scala file directly on the command line, without first going inside spark-shell and then issuing :load?
You can simply use the stdin redirection with spark-shell:
spark-shell < YourSparkCode.scala
This command starts a spark-shell, interprets your YourSparkCode.scala line by line and quits at the end.
Another option is to use -I <file> option of spark-shell command:
spark-shell -I YourSparkCode.scala
The only difference is that the latter command leaves you inside the shell and you must issue :quit command to close the session.
[UDP]
Passing parameters
Since spark-shell does not execute your source as an application but just interprets your source file line by line, you cannot pass any parameters directly as application arguments.
Fortunately, there may be a lot of options to approach the same (e.g, externalizing the parameters in another file and read it in the very beginning in your script).
But I personally find the Spark configuration the most clean and convenient way.
Your pass your parameters via --conf option:
spark-shell --conf spark.myscript.arg1=val1 --conf spark.yourspace.arg2=val2 < YourSparkCode.scala
(please note that spark. prefix in your property name is mandatory, otherwise Spark will discard your property as invalid)
And read these arguments in your Spark code as below:
val arg1: String = spark.conf.get("spark.myscript.arg1")
val arg2: String = spark.conf.get("spark.myscript.arg2")
It is possible via spark-submit.
https://spark.apache.org/docs/latest/submitting-applications.html
You can even put it to bash script either create sbt-task
https://www.scala-sbt.org/1.x/docs/Tasks.html
to run your code.
We have:
spark-shell -i path/to/script.scala
to run a scala script, is it possible to add something like this to the spark-defaults.conf file so that it always loads the scala script on start up of the spark-shell and thus does not have to be added to the command line.
I would like to use this to store import _, credentials and user defined functions that I use regularly so that I don't have to enter the commands every time I start spark-shell.
Thanks,
Shane
You can go to spark directory /bin, create file spark-shell-new.cmd and paste
spark-shell -i path/to/script.scala then run spark-shell-new in cmd like a default spark-shell.
You can do something like this
:load <path_to_script>
Write all the required lines of code in that script
Need to execute the scala script through spark-shell with silent mode. When I am using spark-shell -i "file.scala", after the execution, I am getting into the scala interactive mode. I don't want to get into there.
I have tried to execute the spark-shell -i "file.scala". But I don't know how to execute the script in silent mode.
spark-shell -i "file.scala"
after execution, I get into
scala>
I don't want to get into the scala> mode
Updating (October 2019) for a script that terminates
This question is also about running a script that terminates, that is, a "scala script" that run by spark-shell -i script.scala > output.txt that stopts by yourself (internal instruction System.exit(0) terminates the script). See this question with a good example.
It also needs a "silent mode", it is expected to not pollute the output.txt.
Suppose Spark v2.2+.
PS: there are a lot of cases (typically small tools and module/algorithm tests) where Spark interpreter can be better than compiler... Please, "let's compile!" is not an answer here.
spark-shell -i file.scala keeps the interpreter open
in the end, so System.exit(0) is required to be at the end of your script. The most appropriate solution is to place your code in try {} and put System.exit(0) in finally {} section.
If logging is requiered you can use something like this:
spark-shell < file.scala > test.log 2>&1 &
If you have limitations on editing file and you can't add System.exit(0), use:
echo :quit | scala-shell -i file.scala
UPD
If you want to suppress everything in output except printlns you have to turn off logging for spark-shell. The sample of configs is here. Disabling any kind of logging in $SPARK-HOME/conf/log4j.properties should allow you to see only pritnlns. But I would not follow this approach with printlns. Using general Logging with log4j should be used instead of printlns. You can configure it so obtain the same results as with printlns. It boils down to configuring a pattern. This answer provides an example of a pattern that solves your issue.
The best way is definitively to compile your scala code to a jar and use spark-submit but if you're simply looking for a quick iteration loop, you can simply issue a :quit after parsing your scala code:
echo :quit | scala-shell -i yourfile.scala
Adding onto #rluta's answer. You can place the call to spark-shell command inside a shell script. Say the below in a shell script:
spark-shell < yourfile.scala
But this would require you to keep the lines of code within a line in case a statement is written on different lines.
OR
echo :quit | spark-shell -i yourfile.scala
This should
From the Scala REPL I am running a program thus:
:load foo.scala
I want to pass foo a parameter so I try:
:load foo.scala 1.0
which gives:
usage: :load -v file
Is there any way of running a program from the REPL with parameters like this?
If it is a single source file, I would recommend using the :paste command with which you can copy the whole source file into the REPL and then do a ctrl + c to exit the paste mode.
You could then call your program from the REPL by passing in the parameters that you need!
I've been trying to find some sort of a dotfile to put Scala REPL settings and custom function in.
In particular I'm interested in passing it flags like -Dscala.color (enables syntax highlighting), as well as overriding settings like result string truncation:
scala> :power
scala> vals.isettings.maxPrintString = 10000
It would be nice to have these settings apply to both the simple Scala REPL sessions as well as sbt console sessions.
Does such a central configuration place exist for Scala?
Maybe you can use a modernized Scala REPL:
https://lihaoyi.github.io/Ammonite/
Poor man's solution: Set yourself an alias
alias myScala='scala -Dscala.repl.axPrintString = 10000'
As mentioned here ~/.sbt/0.13/global.sbt is the global configuration file for sbt. You can change your global settings here, this probably not going to effect REPL but should do work with SBT Console
You mainly asked about property settings, this goes a little beyond that to consider loading a definitions file as well—and isn't much help for Windows—but I thought I'd share in case it's useful:
I've resorted to using a wrapper script saved as ~/bin/scala, to set config properties and load some utility functions:
#!/bin/sh
# The scala REPL doesn't have any config file, so this wrapper serves to set
# some property values and load an init file of utilities when run without
# arguments to enter REPL mode.
#
# If there are arguments, just assume we're running a .scala file in script
# mode, a class or jar, etc., and execute normally.
SCALA=${SCALA:-/usr/local/bin/scala}
if [ "$#" -eq 0 ] && [ -r ~/.config/scala/replinit.scala ]; then
exec "$SCALA" -i ~/.config/scala/replinit.scala -Dscala.color
else
exec "$SCALA" "$#"
fi
If you sometimes use Ammonite REPL, as another answer suggests, the utility definitions can be shared by loading them from ~/.ammonite/predef.scala:
try load.exec(ammonite.ops.home/".config"/'scala/"replinit.scala")
catch { case _: Exception => println("=== replrc not loaded! ===") }
I'm not sure about a way to load the init file for sbt console automatically, though—Seth Tisue's comment about the initialize setting is helpful for properties, but using a :load command in a value for initialCommands in console doesn't appear to work.