On Mac how to start a spark-shell in same environment as the environment running in an Intellij project? - scala

I am working on a spark project using scala and maven, and some time I feel it would be very helpful if I can ran the project in an interactive mode.
My question if it is possible (and how) to bring up a spark environment in terminal that same as the environment running in a IntelliJ project?
Or even better (if it is possible) -- start a PERL environment, under IntelliJ debug model, during code ceased running at a break point. So we can continue play with all variables and instances created so far.

Yes, it is possible, though not very straightforward. I first build Fat jar using sbt assembly plugin (https://github.com/sbt/sbt-assembly) and then use a debug configuration like the one below to start it in debugger. Note that org.apache.spark.deploy.SparkSubmit is used as a main class, not your application main class. You app main class is specified in the --class parameter instead.
It is a bit tedious to have to create app jar file before starting each debug session (if sources were changed). I couldn't get SparkSubmit to work with the compiled by IntelliJ class files directly. I'd be happy to hear about alternative ways of doing this.
*Main class:*
org.apache.spark.deploy.SparkSubmit
*VM Options:*
-cp <SPARK_DIR>/conf/:<SPARK_DIR>/jars/* -Xmx6g -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/tmp
*Program arguments:*
--master
local[*]
--class
com.example.YourSparkApp
<PROJECT_DIR>/target/scala-2.11/YourSparkAppFat.jar
<APP_ARGS>
If you don't care much about initialization or can insert a loop in the code where the app waits for a keystroke or any other kind of signal before continuing, then you can start you app as usual and simply attach IntelliJ to the app process (Run > Attach to Local Process...).

Related

Drop into a Scala interpreter in Spark script?

I'm using Scala 2.11.8 and Spark 2.1.0. I'm totally new to Scala.
Is there a simple way to add a single line breakpoint, similar to Python:
import pdb; pdb.set_trace()
where I'll be dropped into a Scala shell and I can inspect what's going on at that line of execution in the script? (I'd settle for just the end of the script, too...)
I'm currently starting my scripts like so:
$SPARK_HOME/bin/spark-submit --class "MyClassName" --master local target/scala-2.11/my-class-name_2.11-1.0.jar
Is there a way to do this? Would help debugging immensely.
EDIT: The solutions in this other SO post were not very helpful / required lots of boilerplate + didn't work.
I would recommend one of the following two options:
Remote debugging & IntelliJ Idea's "evaluate expression"
The basic idea here is that you debug your app like you would if it was just an ordinary piece of code debugged from within your IDE. The Run->Evaluate expression function allows you to prototype code and you can use most of the debugger's usual variable displays, step (over) etc functionality. However, since you're not running the application from within your IDE, you need to:
Setup the IDE for remote debugging, and
Supply the application with the correct Java options for remote debugging.
For 1, go to Run->Edit configurations, hit the + button in the top right hand corner, select remote, and copy the content of the text field under Command line arguments for running remote JVM (official help).
For 2, you can use the SPARK_SUBMIT_OPTS environment variable to pass those JVM options, e.g.:
SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
$SPARK_HOME/bin/spark-submit --class Main --master "spark://127.0.0.1:7077" \
./path/to/foo-assembly-1.0.0.jar
Now you can hit the debug button, and set breakpoints etc.
Apache Zeppelin
If you're writing more script-style Scala, you may find it helpful to write it in a Zeppelin Spark Scala interpreter. While it's more like Jupyter/IPython notebooks/the ipython shell than (i)pdb, this does allow you to inspect what's going on at runtime. This will also allow you to graph your data etc. I'd start with these docs.
Caveat
I think the above will only allow debugging code running on the Driver node, not on the Worker nodes (which run your actual map, reduce etc functions). If you for example set a breakpoint inside an anonymous function inside myDataFrame.map{ ... }, it probably won't be hit, since that's executed on some worker node. However, with e.g. myDataFrame.head and the evaluate expression functionality I've been able to fulfil most of my debugging needs. Having said that, I've not tried to specifically pass Java options to executors, so perhaps it's possible (but probably tedious) to get it work.

Is there any way to fork the SBT console into a new JVM?

For all the reasons listed here:
http://www.scala-sbt.org/0.13/docs/Running-Project-Code.html
it's sometimes necessary to run your Scala code in a separate JVM from the one in which SBT is running. That's also true of the REPL, which you access from the console or test:console commands.
Unfortunately, it doesn't appear that SBT supports running the console in its own JVM (and I'm posting this question here, as requested in the message):
https://groups.google.com/forum/#!topic/simple-build-tool/W0q62PfSIMo
Can someone confirm that this isn't possible and suggest a possible workaround? I'm trying to play with a ScalaFX app in the console, and I have to quit SBT completely each time I run it. It'd be nice to just have to quit the console and keep SBT running.

Attaching a Remote Debug session to Spark from Eclipse Scala IDE

I've been wracking my brain over this for the last two days trying to get it to work. I have a local Spark installation on my Mac that I'm trying to attach a debugger to. I set:
SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
Then I submit my job to spark-submit and launch my debug configuration in eclipse which is configured as a Socket Attach Remote Debugging session. The debugger attaches, my job resumes and executes, but none of my breakpoints are ever hit, no matter what I do.
The only way I can get it to hit a breakpoint is by attaching to a spark-shell, creating a Java Exception breakpoint and issuing
throw new java.lang.Exception()
The debugger will not stop at normal breakpoints for me.
I created a standalone Hellow World scala app and was able to attach to it and have it stop at a regular breakpoint without any issues.
Environment: Mac OS, latest Eclipse, latest Scala IDE, Spark 1.3.1, Scala 2.10.5
Thanks in advance.
I had a similar issue and there were 2 things that fixed my problem -
1. The .jar file and source not a little out of sync for me , so had to recompile and redeploy.
2. Next On the JAVAOPTS I had a suspend=n.
After correcting these two it worked for me .

xsbt-web-plugin Running the web servelet container outside of sbt?

I'm using the xsbt-web-plugin to host my servelet. It works fine, using container:start.
I now need it to run in the background, like a daemon, even if I hang up, and ideally, even if the machine reboots. I'd rather not have to invoke sbt.
I know that the plugin can package a WAR file, but I'm not running tomcat or anything like that. I just want to do what container:start does, but in a more robust (read: noninteractive) way.
(My goal is for a demo of dev: I'd hate for my ssh session to drop sbt, or something similar like that, while people are using the demo. But we're not ready for production yet and have no servelet infrastructure.)
xsbt-web-plugin is really not meant to act as a production server (with features like automatic restarting, fault recovery, running on boot, etc.), however I understand the utility of using it this way for small-scale development purposes.
You have a couple of options:
First approach
Run sbt in a screen session, which you can (dis)connect at will without interrupting sbt.
Second approach
Override the shutdown function that triggers on sbt's exit hook, so that the container keeps running after sbt stops.
For this approach, add the following setting to your sbt configuration:
build.sbt:
onLoad in Global := { state => state }
Note that this will override the onLoad setting entirely, so in the (unlikely) case that you have it configured to do other important things, they won't happen.
Now you can launch your container either by running container:start from sbt and then exiting sbt, or simply by running sbt container:start from the command line, which will return after forking the container JVM. Give it a few seconds, then you should be able to request to localhost:8080.

deploying a scala app built using IDEA

I developed a simple scala app that uses casbah to query the DB for the command line argument passed to it. For example
$ querydb.scala execution 10
it will run a casbah query to find 10 records matching execution in mongo. Now i have two questions.
1) How do i test this in my local. If i click execute in intellij it is just running the program, i am not able to pass command line arguments to my program.
2) How do i deploy it to run on my server, it is just going to used as console app in my ubuntu server, but im not sure how i should deploy this, which files i should put up on the server and how do i execute it in server, and stuff like that.
Any pointers would be useful for me.
or try to use sbt, IDEA has a plugin with sbt, the wiki of it has an explanation on how to use it.
I usually use sbt directly in Terminal instead of running in IDE.
1) First you need to find "Select Run/Debug Configuration" button at the top of your screen
Click on it and choose edit
Create new one, if you haven't got it yet.
Your program parameters should be written in "Program parameters" field
2) Compile your .scala files with scalac and you'll got .class files.
Then deploy it, as you usually do with java code. Hence you don't need to install scala on target machine - all you need is JDK.