Drop into a Scala interpreter in Spark script? - scala

I'm using Scala 2.11.8 and Spark 2.1.0. I'm totally new to Scala.
Is there a simple way to add a single line breakpoint, similar to Python:
import pdb; pdb.set_trace()
where I'll be dropped into a Scala shell and I can inspect what's going on at that line of execution in the script? (I'd settle for just the end of the script, too...)
I'm currently starting my scripts like so:
$SPARK_HOME/bin/spark-submit --class "MyClassName" --master local target/scala-2.11/my-class-name_2.11-1.0.jar
Is there a way to do this? Would help debugging immensely.
EDIT: The solutions in this other SO post were not very helpful / required lots of boilerplate + didn't work.

I would recommend one of the following two options:
Remote debugging & IntelliJ Idea's "evaluate expression"
The basic idea here is that you debug your app like you would if it was just an ordinary piece of code debugged from within your IDE. The Run->Evaluate expression function allows you to prototype code and you can use most of the debugger's usual variable displays, step (over) etc functionality. However, since you're not running the application from within your IDE, you need to:
Setup the IDE for remote debugging, and
Supply the application with the correct Java options for remote debugging.
For 1, go to Run->Edit configurations, hit the + button in the top right hand corner, select remote, and copy the content of the text field under Command line arguments for running remote JVM (official help).
For 2, you can use the SPARK_SUBMIT_OPTS environment variable to pass those JVM options, e.g.:
SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
$SPARK_HOME/bin/spark-submit --class Main --master "spark://127.0.0.1:7077" \
./path/to/foo-assembly-1.0.0.jar
Now you can hit the debug button, and set breakpoints etc.
Apache Zeppelin
If you're writing more script-style Scala, you may find it helpful to write it in a Zeppelin Spark Scala interpreter. While it's more like Jupyter/IPython notebooks/the ipython shell than (i)pdb, this does allow you to inspect what's going on at runtime. This will also allow you to graph your data etc. I'd start with these docs.
Caveat
I think the above will only allow debugging code running on the Driver node, not on the Worker nodes (which run your actual map, reduce etc functions). If you for example set a breakpoint inside an anonymous function inside myDataFrame.map{ ... }, it probably won't be hit, since that's executed on some worker node. However, with e.g. myDataFrame.head and the evaluate expression functionality I've been able to fulfil most of my debugging needs. Having said that, I've not tried to specifically pass Java options to executors, so perhaps it's possible (but probably tedious) to get it work.

Related

On Mac how to start a spark-shell in same environment as the environment running in an Intellij project?

I am working on a spark project using scala and maven, and some time I feel it would be very helpful if I can ran the project in an interactive mode.
My question if it is possible (and how) to bring up a spark environment in terminal that same as the environment running in a IntelliJ project?
Or even better (if it is possible) -- start a PERL environment, under IntelliJ debug model, during code ceased running at a break point. So we can continue play with all variables and instances created so far.
Yes, it is possible, though not very straightforward. I first build Fat jar using sbt assembly plugin (https://github.com/sbt/sbt-assembly) and then use a debug configuration like the one below to start it in debugger. Note that org.apache.spark.deploy.SparkSubmit is used as a main class, not your application main class. You app main class is specified in the --class parameter instead.
It is a bit tedious to have to create app jar file before starting each debug session (if sources were changed). I couldn't get SparkSubmit to work with the compiled by IntelliJ class files directly. I'd be happy to hear about alternative ways of doing this.
*Main class:*
org.apache.spark.deploy.SparkSubmit
*VM Options:*
-cp <SPARK_DIR>/conf/:<SPARK_DIR>/jars/* -Xmx6g -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/tmp
*Program arguments:*
--master
local[*]
--class
com.example.YourSparkApp
<PROJECT_DIR>/target/scala-2.11/YourSparkAppFat.jar
<APP_ARGS>
If you don't care much about initialization or can insert a loop in the code where the app waits for a keystroke or any other kind of signal before continuing, then you can start you app as usual and simply attach IntelliJ to the app process (Run > Attach to Local Process...).

Running simple Scala in Intellij

I'm trying to run a very simple hello world Scala program in IntelliJ IDEA on Mac without using the Scala console configuration. I have followed these steps to largely get started, but I didn't set up the debugger outlined there. There isn't a default run configuration enabled, but I can right-click on my source file and select "Scala Console," as we can see here:
Is there a way to select or edit my configurations to make it so I don't have to use the console? Below are the available configurations.
I simply want there to be a way to run my Scala code and see the generated output in the provided console, which Scala Console isn't doing. Thanks for your time.
Lab6 should be an object, not a class.
This will allow you to run it as a main method

how to auto-reload changed scala classes into SBT REPL

I am new to Scala and to using emacs + ensime + sbt setup for my Scala development.
This setup is quite nice and light, but there is one thing which drives me nuts - inability to auto-compile / reload changes into Scala console started from sbt.
I use REPL a lot and would like to be able to start REPL from sbt with console command and test my changes to scala classes from REPL without having to close it and reload every time I make a change.
I come from Erlang environment and this way of development is easy with Erlang but seems to be difficult with SBT. I have the JRebel plug-in installed but it doesn't seem to be working for the situation I described.
Has anybody been able to make something similar work and would be willing to share the configuration steps?
Much appreciated in advance.
There are two things possible in sbt:
Causing automatic recompilation of the project sources triggered by a file change by prefixing a command with ~ (tilde). The console, or console-quick, or console-project commands can be prefixed, too, but you have to exit REPL to make the recompilation happen (just hit Ctrl+D and wait.)
Causing automatic execution of REPL commands just after firing the console. They can be defined as properties (e.g. in build.sbt):
initialCommands in console := """
import some.library._
def someFun = println("Hello")
"""
It's not necessary to define the property separately in consoleQuick because it defaults to the one defined in console, but if you would like to use the console-project command you have to define it separately.
On a final note: remember to leave empty line between every property in an *.sbt file. They're necessary to parse the properties correctly. In the example above there are no blank lines in between so it means that everything goes into the initialCommands property (and that's what we want.)

Executing Scala via TextMate

I am going through the Scala book i bought and am trying to get TextMate to cooperate
I got the point where:
Fails
Works
Fails (understandably why) main is missing
What is it?
What is it?
What is it?
What the difference between :
Run Script
Run as script
What are options 4,5,6?
Don't know why it gives you two run script options, but scala REPL is an interactive scala console. You can execute any scala statements on the fly with it. Option 4 will just start a console, option 5 I suppose with preload everything you have in your file, so you don't have to import everything yourself. 6 will start a console and let you execute whatever you selected.
Here's more about REPL: http://www.scala-lang.org/node/2097

deploying a scala app built using IDEA

I developed a simple scala app that uses casbah to query the DB for the command line argument passed to it. For example
$ querydb.scala execution 10
it will run a casbah query to find 10 records matching execution in mongo. Now i have two questions.
1) How do i test this in my local. If i click execute in intellij it is just running the program, i am not able to pass command line arguments to my program.
2) How do i deploy it to run on my server, it is just going to used as console app in my ubuntu server, but im not sure how i should deploy this, which files i should put up on the server and how do i execute it in server, and stuff like that.
Any pointers would be useful for me.
or try to use sbt, IDEA has a plugin with sbt, the wiki of it has an explanation on how to use it.
I usually use sbt directly in Terminal instead of running in IDE.
1) First you need to find "Select Run/Debug Configuration" button at the top of your screen
Click on it and choose edit
Create new one, if you haven't got it yet.
Your program parameters should be written in "Program parameters" field
2) Compile your .scala files with scalac and you'll got .class files.
Then deploy it, as you usually do with java code. Hence you don't need to install scala on target machine - all you need is JDK.