I am relatively new to Spark and Scala.
I have a scala application that runs in local mode both on my windows box and a Centos cluster.
As long as spark is in my classpath (i.e., pom.xml), spark runs as unit tests without the need for a SPARK_HOME. But then how do I set Spark properties such as spark.driver.memory?
If I do have an instance of spark running locally, my unit test application seems to ignore it when in local mode. I do not see any output on the spark console suggesting it is using the spark instance I started from the command line (via spark-shell command). Am I mistaken? If not, how do I get my scala application to use that instance?
EDITED to include useful info from comments as well
spark_shell is just an interactive shell, it stands alone and is not an "instance" that other processes should connect to. When you run your spark application through spark-submit (or just running your spark code) it will start its own instance of spark. If you need to set any properties they can be bassed in a system properties or through the spark-submit --conf parameters
spark-submit requires that first you use maven assembly plugin to compile your application jar and dependencies.
This should then be deployed to the SPARK_HOME directory
Then use the submit script which must also be deployed in SPARK_HOME
The spark-submit script looks like this:
./bin/spark-submit --class xxx.ml.PipelineStart
--master local[*]
./xxx/myApp-1.0-SNAPSHOT-jar-with-dependencies.jar 100
You can set options in your SparkConf. Look at the methods available in the documentation.
There are explicit methods like SparkConf.setMaster to set certain properties. However, if you don't see a method to explicitly set a property, then just use SparkConf.set. It takes a key and a value, and the configurable properties are all found here.
If you're curious about what a property is set to, then you can also use SparkConf.get to check that out.
Related
I am observing weird behavior of spark and closures in local mode on my machine v/s a 3 node cluster (Spark 2.4.5).
Following is the piece of code
object Example {
val num=5
def myfunc={
sc.parallelize(1 to 4).map(_+num).foreach(println)
}
}
I expected this to fail regardless since the local variable num is needed in the closure and therefore Example object would need to be serialized but it cannot be since it does not extend Serializable interface.
when I run the same piece of code from spark-shell on my same local machine, it fails with the error given the rationale above:
When I run the same piece of code in yarn mode on a 3 node EMR cluster, it fails with the exact same error as in the above screenshot...given the same rationale as mentioned above.
when I run the same piece of code in local mode on a the same cluster (=> master node), it also fails. The same rationale still holds true.
However, this, when I run from an sbt project (not a Spark installation or anything...just added Spark libraries to my sbt project and used a conf.master(local[..]) in local mode runs fine and gives me an o/p of 6,7,8,9:
This means its running fine everywhere except when you run it by adding Spark dependencies in your sbt project. The question is what explains the different local mode behavior when running your Spark code by simply adding your Spark libraries in sbt project?
I installed spark with sbt in project dependecies. Then I want to change variables of the spark env without doing it within my code with a .setMaster(). The problem is that i cannot find any config file on my computer.
This is because I have an error : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#my-mbp.domain_not_set.invalid:50487even after trying to change my hostname. Thus, I would like to go deep into spark library and try some things.
I tried pretty much everything that is on this so post : Invalid Spark URL in local spark session.
Many thanks
What worked for the issue:
export SPARK_LOCAL_HOSTNAME=localhost in shell profil (e.g. ~/.bash_profil)
SBT was not able to find the host even using the command just before running sbt. I had to put it in the profil to have a right context.
I am working on a spark project using scala and maven, and some time I feel it would be very helpful if I can ran the project in an interactive mode.
My question if it is possible (and how) to bring up a spark environment in terminal that same as the environment running in a IntelliJ project?
Or even better (if it is possible) -- start a PERL environment, under IntelliJ debug model, during code ceased running at a break point. So we can continue play with all variables and instances created so far.
Yes, it is possible, though not very straightforward. I first build Fat jar using sbt assembly plugin (https://github.com/sbt/sbt-assembly) and then use a debug configuration like the one below to start it in debugger. Note that org.apache.spark.deploy.SparkSubmit is used as a main class, not your application main class. You app main class is specified in the --class parameter instead.
It is a bit tedious to have to create app jar file before starting each debug session (if sources were changed). I couldn't get SparkSubmit to work with the compiled by IntelliJ class files directly. I'd be happy to hear about alternative ways of doing this.
*Main class:*
org.apache.spark.deploy.SparkSubmit
*VM Options:*
-cp <SPARK_DIR>/conf/:<SPARK_DIR>/jars/* -Xmx6g -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/tmp
*Program arguments:*
--master
local[*]
--class
com.example.YourSparkApp
<PROJECT_DIR>/target/scala-2.11/YourSparkAppFat.jar
<APP_ARGS>
If you don't care much about initialization or can insert a loop in the code where the app waits for a keystroke or any other kind of signal before continuing, then you can start you app as usual and simply attach IntelliJ to the app process (Run > Attach to Local Process...).
I'm using Scala 2.11.8 and Spark 2.1.0. I'm totally new to Scala.
Is there a simple way to add a single line breakpoint, similar to Python:
import pdb; pdb.set_trace()
where I'll be dropped into a Scala shell and I can inspect what's going on at that line of execution in the script? (I'd settle for just the end of the script, too...)
I'm currently starting my scripts like so:
$SPARK_HOME/bin/spark-submit --class "MyClassName" --master local target/scala-2.11/my-class-name_2.11-1.0.jar
Is there a way to do this? Would help debugging immensely.
EDIT: The solutions in this other SO post were not very helpful / required lots of boilerplate + didn't work.
I would recommend one of the following two options:
Remote debugging & IntelliJ Idea's "evaluate expression"
The basic idea here is that you debug your app like you would if it was just an ordinary piece of code debugged from within your IDE. The Run->Evaluate expression function allows you to prototype code and you can use most of the debugger's usual variable displays, step (over) etc functionality. However, since you're not running the application from within your IDE, you need to:
Setup the IDE for remote debugging, and
Supply the application with the correct Java options for remote debugging.
For 1, go to Run->Edit configurations, hit the + button in the top right hand corner, select remote, and copy the content of the text field under Command line arguments for running remote JVM (official help).
For 2, you can use the SPARK_SUBMIT_OPTS environment variable to pass those JVM options, e.g.:
SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
$SPARK_HOME/bin/spark-submit --class Main --master "spark://127.0.0.1:7077" \
./path/to/foo-assembly-1.0.0.jar
Now you can hit the debug button, and set breakpoints etc.
Apache Zeppelin
If you're writing more script-style Scala, you may find it helpful to write it in a Zeppelin Spark Scala interpreter. While it's more like Jupyter/IPython notebooks/the ipython shell than (i)pdb, this does allow you to inspect what's going on at runtime. This will also allow you to graph your data etc. I'd start with these docs.
Caveat
I think the above will only allow debugging code running on the Driver node, not on the Worker nodes (which run your actual map, reduce etc functions). If you for example set a breakpoint inside an anonymous function inside myDataFrame.map{ ... }, it probably won't be hit, since that's executed on some worker node. However, with e.g. myDataFrame.head and the evaluate expression functionality I've been able to fulfil most of my debugging needs. Having said that, I've not tried to specifically pass Java options to executors, so perhaps it's possible (but probably tedious) to get it work.
I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!
There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.