How to debug a scala based Spark program on Intellij IDEA

How to debug a scala based Spark program on Intellij IDEA - scala

I am currently building my development IDE using Intellij IDEA. I followed exactly the same way as http://spark.apache.org/docs/latest/quick-start.html
build.sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
Sample Program File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object MySpark {
def main(args: Array[String]){
val logFile = "/IdeaProjects/hello/testfile.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
If I use command line:
sbt package
and then
spark-submit --class "MySpark" --master local[4] target/scala-2.11/myspark_2.11-1.0.jar
I am able to generate jar package and spark runs well.
However, I want to use Intellij IDEA to debug the program in the IDE. How can I setup the configuration, so that if I click "debug", it will automatically generate the jar package and automatically launch the task by executing "spark-submit-" command line.
I just want everything could be simple as "one click" on the debug button in Intellij IDEA.
Thanks.

First define environment variable like below
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=7777
Then create the Debug configuration in Intellij Idea as follows
Rub-> Edit Configuration -> Click on "+" left top cornor -> Remote -> set port and name
After above configuration run spark application with spark-submit or sbt run and then run debug which is created in configuration. and add checkpoints for debug.

If you're using the scala plugin and have your project configured as an sbt project, it should basically work out of the box.
Go to Run->Edit Configurations... and add your run configuration normally.
Since you have a main class, you probably want to add a new Application configuration.
You can also just click on the blue square icon, to the left of your main code.
Once your run configuration is set up, you can use the Debug feature.

I've run into this when I switch between 2.10 and 2.11. SBT expects the primary object to be in src->main->scala-2.10 or src->main->scala-2.11 depending on your version.

It is similar to the solution provided here: Debugging Spark Applications.
You create a Remote debug run configuration in Idea and pass Java debug parameters to the spark-submit command.
The only catch is you need to start the remote debug config in Idea after triggering the spark-submit command. I read somewhere that a Thread.sleep just before your debug point should enable you to do this and I too was able to successfully use the suggestion.

Related

Run spark-shell from sbt

The default way of getting spark shell seems to be to download the distribution from the website. Yet, this spark issue mentions that it can be installed via sbt. I could not find documentation on this. In a sbt project that uses spark-sql and spark-core, no spark-shell binary was found.
How do you run spark-shell from sbt?

From the following URL:
https://bzhangusc.wordpress.com/2015/11/20/use-sbt-console-as-spark-shell/
If you already using Sbt for your project, it’s very simple to setup Sbt Console to replace Spark-shell command.
Let’s start from the basic case. When you setup the project with sbt, you can simply run the console as sbt console
Within the console, you just need to initiate SparkContext and SQLContext to make it behave like Spark Shell
scala> val sc = new org.apache.spark.SparkContext("localhell")
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

"Need to install JAI Image I/O package." error when using tess4j in IntelliJ IDEA Scala SBT project

The below tess4j JARs are part of my Scala SBT project in IntelliJ IDEA and are also added as module dependencies:
However, I get a java.lang.RuntimeException: Need to install JAI Image I/O package. https://java.net/projects/jai-imageio/ exception when trying to run the following code in a Scala worksheet:
import java.io.File
import net.sourceforge.tess4j._
val imageFile = new File("LinkToJPGFile")
val instance = new Tesseract()
instance.setDatapath("MyTessdataFolder")
val result = instance.doOCR(imageFile)
print(result)
even though jai-imageio-core-1.3.1.jar is properly included in the project.

Instead of trying to add the JARs individually, add the below line to your build.sbt:
// https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j
libraryDependencies += "net.sourceforge.tess4j" % "tess4j" % "3.3.1"
Or whichever version you are using found at https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j

Custom sbt configuration with Intellij auto import

I can't get the embedded sbt plugin (with auto import enabled) in Intellij (13.1) to recognize custom sbt configurations. I have the follow setup in my sbt build file:
lazy val EndToEndTest = config("e2e") extend (Test)
private lazy val e2eSettings =
inConfig(EndToEndTest)(Defaults.testSettings)
lazy val root: Project = Project(
id = "root",
base = file(".")
)
.configs(EndToEndTest)
.settings(e2eSettings)
The code works according to expectations in the sbt console. E.g I can write:
sbt e2e:test (and it will execute tests located in /src/e2e/scala)
The issue is that the directory /src/e2e/scala won't get registered as a source directory in Intellij. This makes it hard to use intellij to manage the tests. I can manually mark the directory as source but it gets reverted every time
I update my sbt files (auto import).
Do a manual update through the sbt tool window
Related:
Using the preconfigured configuration IntegrationTest works as expected but custom once don't.

According to sbt-idea documentation this can be done in your case by adding
ideaExtraTestConfigurations := Seq(EndToEndTest)
to your project settings.

Running Spark Application from Eclipse

I am trying to develop a spark application on Eclipse, and then debug it by stepping through it.
I downloaded the Spark source code and I have added some of the spark sub projects(such as spark-core) to Eclipse. Now, I am trying to develop a spark application using Eclipse. I have already installed the ScalaIDE on Eclipse. I created a simple application based on the example given in the Spark website.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
To my project, I added the spark-core project as a dependent project(right click -> build path -> add project). Now, I am trying to build my application and run it. However, my project shows that it has errors, but I don't see any errors listed in the problems view within Eclipse, nor do I see any lines highlighted in red. So, I am not sure what the problem is. My assumption is that I need to add external jars to my project, but I am not sure what these jars would be. The error is caused by val conf = new SparkConf().setAppName("Simple Application") and the subsequent lines. I tried removing those lines, and the error went away. I would appreciate any help and guidance, thanks!

It seems you are not using any package/library manager (e.g. sbt, maven) which should eliminate any versioning issues. It might be challenging to set correct version of java, scala, spark and all its subsequent dependencies on your own.
I strongly recommend to change your your project into Maven:
Convert Existing Eclipse Project to Maven Project
Personally, I have very good experiences with sbt on IntelliJ IDEA (https://confluence.jetbrains.com/display/IntelliJIDEA/Getting+Started+with+SBT) which is easy to set up and maintain.

I've just created a Maven archetype for Spark the other day.
It sets up a new Spark 1.3.0 project in Eclipse/Idea with Scala 2.10.4.
Just follow the instructions here.
You'll just have to change the Scala version after the project is generated:
Right click on the generated project and select:
Scala > Set the Scala Installation > Fixed 2.10.5.(bundled)
The default version that comes with ScalaIDE (currently 2.11.6) is automatically added to the project by ScalaIDE when it detects scala-maven-plugin in the pom.
I'd appreciate a feedback, if someone knows how to set the Scala library container version from Maven, while it bootstraps a new project. Where does the ScalaIDE look up the Scala version, if anywhere?
BTW Just make sure you download sources (Project right-click > Maven > Download sources) before stepping into Spark code in debugger.
If you want to use (IMHO the very best) Eclipse goodies (References, Type hierarchy, Call hierarchy) you'll have to build Spark yourself, so that all the sources are on your build path (as Maven Scala dependencies are not processed by EclipseIDE/JDT, even though they are, of course, on the build path).
Have fun debugging, I can tell you that it helped me tremendously to get deeper into Spark and really understand how it works :)

You could try to add the spark-assembly.jar instead.
As other have noted, the better way is to use Sbt (or Maven) to manage your dependencies. spark-core has many dependencies itself, and adding just that one jar won't be enough.

You haven't specified the master in you spark code. Since you're running it on your local machine. Replace following line
val conf = new SparkConf().setAppName("Simple Application")
with
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
Here "local[2]" means 2 threads will be used.

How to compile a spark-cassandra programs using scala?

Lately I started learning spark and cassandra, I know that we can use spark in both python and scala and java, and I 've read docs on this website: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md, the thing is, after I create a program named testfile.scala with those codes the document says,(I don't know if I am right using .scala), however, i don't know how to compile it,can anyone guide me what to do with it?
Here are the testfile.scala:
import com.datastax.spark.connector._
import com.datastax.spark.connector.streaming._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ssc.actorStream[String](Props[SimpleStreamingActor], actorName, StorageLevel.MEMORY_AND_DISK)
val wc = stream.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)

Scala projects are compiled by scalac, but it's quite low level: you have to setup build paths and manage all dependencies, so most people fall back to some build tool such as sbt which will manage a lot of stuff for you. The other two commonly used built tools are maven, which is favored by java old-schoolers and gradle, which is more down to earth
> how to import spark-cassandra-connector
I've set up example project. Basically, you define all of your dependencies in built.sbt or it's analog, here is how dependency on spark-cassandra-connector is defined (line #12).
> And, is it a rule that we have to code with class or object
Yes and no. If you code with sbt, all your code files has to be wrapped into object, but, sbt allows you to code in it's shell and code that you input to it is not required to be wrapped (same rules as with ordinary scala REPL). Next, both IDEA and Eclipse have worksheet capabilities, so you can create test.sc and draft your code there.