Running Spark Application from Eclipse - eclipse

I am trying to develop a spark application on Eclipse, and then debug it by stepping through it.
I downloaded the Spark source code and I have added some of the spark sub projects(such as spark-core) to Eclipse. Now, I am trying to develop a spark application using Eclipse. I have already installed the ScalaIDE on Eclipse. I created a simple application based on the example given in the Spark website.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
To my project, I added the spark-core project as a dependent project(right click -> build path -> add project). Now, I am trying to build my application and run it. However, my project shows that it has errors, but I don't see any errors listed in the problems view within Eclipse, nor do I see any lines highlighted in red. So, I am not sure what the problem is. My assumption is that I need to add external jars to my project, but I am not sure what these jars would be. The error is caused by val conf = new SparkConf().setAppName("Simple Application") and the subsequent lines. I tried removing those lines, and the error went away. I would appreciate any help and guidance, thanks!

It seems you are not using any package/library manager (e.g. sbt, maven) which should eliminate any versioning issues. It might be challenging to set correct version of java, scala, spark and all its subsequent dependencies on your own.
I strongly recommend to change your your project into Maven:
Convert Existing Eclipse Project to Maven Project
Personally, I have very good experiences with sbt on IntelliJ IDEA (https://confluence.jetbrains.com/display/IntelliJIDEA/Getting+Started+with+SBT) which is easy to set up and maintain.

I've just created a Maven archetype for Spark the other day.
It sets up a new Spark 1.3.0 project in Eclipse/Idea with Scala 2.10.4.
Just follow the instructions here.
You'll just have to change the Scala version after the project is generated:
Right click on the generated project and select:
Scala > Set the Scala Installation > Fixed 2.10.5.(bundled)
The default version that comes with ScalaIDE (currently 2.11.6) is automatically added to the project by ScalaIDE when it detects scala-maven-plugin in the pom.
I'd appreciate a feedback, if someone knows how to set the Scala library container version from Maven, while it bootstraps a new project. Where does the ScalaIDE look up the Scala version, if anywhere?
BTW Just make sure you download sources (Project right-click > Maven > Download sources) before stepping into Spark code in debugger.
If you want to use (IMHO the very best) Eclipse goodies (References, Type hierarchy, Call hierarchy) you'll have to build Spark yourself, so that all the sources are on your build path (as Maven Scala dependencies are not processed by EclipseIDE/JDT, even though they are, of course, on the build path).
Have fun debugging, I can tell you that it helped me tremendously to get deeper into Spark and really understand how it works :)

You could try to add the spark-assembly.jar instead.
As other have noted, the better way is to use Sbt (or Maven) to manage your dependencies. spark-core has many dependencies itself, and adding just that one jar won't be enough.

You haven't specified the master in you spark code. Since you're running it on your local machine. Replace following line
val conf = new SparkConf().setAppName("Simple Application")
with
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
Here "local[2]" means 2 threads will be used.

Related

How to avoid jar conflicts in a databricks workspace with multiple clusters and developers working in parallel?

We are working in an environment where multiple developers upload jars to a Databricks cluster with the following configuration:
DBR: 7.3 LTS
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
Scala: 2.12.10
Python: 3.7.5
R: R version 3.6.3 (2020-02-29)
Delta Lake: 0.7.0
Build tool: Maven
Below is our typical workflow:
STEP 0:
Build version 1 of the jar (DemoSparkProject-1.0-SNAPSHOT.jar) with the following object:
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 1.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
STEP 1:
Uninstall the old jar(s) from the cluster, and keep pushing new changes in subsequent versions with small changes to the logic. Hence, we push jars with versions 2.0-SNAPSHOT, 3.0-SNAPSHOT etc.
At a point in time, when we push the same object with the following code in the jar say (DemoSparkProject-4.0-SNAPSHOT.jar):
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 4.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
When we import this object in the notebook and run the main function we still get the old snapshot version jar println statement (EntryObjectOne: This object is from 1.0 SNAPSHOT JAR). This forces us from running a delete on the dbfs:/FileStore/jars/* and restarting the cluster and pushing the latest snapshot again to make it work.
In essence when I run sc.listJars() the active jar in the driver is the latest 4.0-SNAPSHOT jar. Yet, I still see the logic from old snapshot jars even though they are not installed on the cluster at runtime.
Resolutions we tried/implemented:
We tried using the maven shade plugin, but unfortunately, Scala does not support it. (details here).
We delete the old jars from dbfs:/FileStore/jars/* and restart the cluster and install the new jars regularly. This works, but a better approach can definitely help. (details here).
Changing the classpath manually and building the jar with different groupId using maven also helps. But with lots of objects and developers working in parallel, it is difficult to keep track of these changes.
Is this the right way of working with multiple jar versions in DataBricks? If there is a better way to handle this version conflict issue in DataBricks it will help us a lot.
You can't do with libraries packaged as Jar - when you install library, it's put into classpath, and will be removed only when you restart the cluster. Documentation says explicitly about that:
When you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the cluster, the status of the uninstalled library appears as Uninstall pending restart.
It's the same issue as with "normal" Java programs, Java just doesn't support this functionality. See, for example, answers to this question.
For Python & R it's easier because they support notebook-scoped libraries, where different notebooks can have different versions of the same library.
P.S. If you're doing unit/integration testing, my recommendation would be to execute tests as Databricks jobs - it will be cheaper, and you won't have conflict between different versions.
In addition to what's mentioned in the docs: when working with notebooks you could understand what's added on the driver by running this in a notebook cell:
%sh
ls /local_disk0/tmp/ | grep addedFile
This worked for me on Azure Databricks and it will list you all added jars.
Maybe force a cleanup with init scripts ?

How to reuse Ammonite REPL's sc files in a sbt project?

I have some reusable Ammonite REPL's sc files that were used in some Jupyter Scala notebooks.
Now I am creating a standalone application built from sbt. I hope I can reuse these existing sc files in the sbt project.
Is it possible to share these sc files for both Jupyter Scala/Ammonite REPL and sbt projects? How to make scala sources and sc files compile together?
I created Import.scala, a Scala compiler plugin that enables magic imports.
With the help of Import.scala, code snippets in a .sc file can be loaded into Scala source file in a sbt project with the same syntax as Ammonite or Jupyter Scala:
Given a MyScript.sc file.
// MyScript.sc
val elite = 31337
Magic import it in another file.
import $file.MyScript
It works.
assert(MyScript.elite == 31337)

How to debug a scala based Spark program on Intellij IDEA

I am currently building my development IDE using Intellij IDEA. I followed exactly the same way as http://spark.apache.org/docs/latest/quick-start.html
build.sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
Sample Program File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object MySpark {
def main(args: Array[String]){
val logFile = "/IdeaProjects/hello/testfile.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
If I use command line:
sbt package
and then
spark-submit --class "MySpark" --master local[4] target/scala-2.11/myspark_2.11-1.0.jar
I am able to generate jar package and spark runs well.
However, I want to use Intellij IDEA to debug the program in the IDE. How can I setup the configuration, so that if I click "debug", it will automatically generate the jar package and automatically launch the task by executing "spark-submit-" command line.
I just want everything could be simple as "one click" on the debug button in Intellij IDEA.
Thanks.
First define environment variable like below
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=7777
Then create the Debug configuration in Intellij Idea as follows
Rub-> Edit Configuration -> Click on "+" left top cornor -> Remote -> set port and name
After above configuration run spark application with spark-submit or sbt run and then run debug which is created in configuration. and add checkpoints for debug.
If you're using the scala plugin and have your project configured as an sbt project, it should basically work out of the box.
Go to Run->Edit Configurations... and add your run configuration normally.
Since you have a main class, you probably want to add a new Application configuration.
You can also just click on the blue square icon, to the left of your main code.
Once your run configuration is set up, you can use the Debug feature.
I've run into this when I switch between 2.10 and 2.11. SBT expects the primary object to be in src->main->scala-2.10 or src->main->scala-2.11 depending on your version.
It is similar to the solution provided here: Debugging Spark Applications.
You create a Remote debug run configuration in Idea and pass Java debug parameters to the spark-submit command.
The only catch is you need to start the remote debug config in Idea after triggering the spark-submit command. I read somewhere that a Thread.sleep just before your debug point should enable you to do this and I too was able to successfully use the suggestion.

eclipse(set with scala envirnment) : object apache is not a member of package org

As shown in image, its giving error when i am importing the Spark packages. Please help. When i hover there, it shows "object apache is not a member of package org".
I searched on this error, it shows spark jars has not been imported. So, i imported "spark-assembly-1.4.1-hadoop2.2.0.jar" too. But still same error.Below is what i actually want to run:
import org.apache.spark.{SparkConf, SparkContext}
object ABC {
def main(args: Array[String]){
//Scala Main Method
println("Spark Configuration")
val conf = new SparkConf()
conf.setAppName("My First Spark Scala Application")
conf.setMaster("spark://ip-10-237-224-94:7077")
println("Creating Spark Context")
}
}
Adding spark-core jar in your classpath should resolve your issue. Also if you are using some build tools like Maven or Gradle (if not then you should because spark-core has lot many dependencies and you would keep getting such problem for different jars), try to use Eclipse task provided by these tools to properly set classpath in your project.
I was also receiving the same error, in my case it was compatibility issue. As Spark 2.2.1 is not compatible with Scala 2.12(it is compatible with 2.11.8) and my IDE was supporting Scala 2.12.3.
I resolved my error by
1) Importing the jar files from the basic folder of Spark. During the installation of Spark in our C drive we have a folder named Spark which contains Jars folder in it. In this folder one can get all the basic jar files.
Goto to Eclipse right click on the project -> properties-> Java Build Path. Under 'library' category we will get an option of ADD EXTERNAL JARs.. Select this option and import all the jar files of 'jars folder'. click on Apply.
2) Again goto properties -> Scala Compiler ->Scala Installation -> Latest 2.11 bundle (dynamic)*
*before selecting this option one should check the compatibility of SPARK and SCALA.
The problem is Scala is NOT backward compatible. Hence each Spark module is complied against specific Scala library. But when we run from eclipse, we have one SCALA VERSION which was used to compile and create the spark Dependency Jar which we add to the build path, and SECOND SCALA VERSION is there as the eclipse run time environment. Both may conflict.
This is a hard reality, although, we wish Scala to be ,backward compatible. Or at least a complied jar file created could be backward compatible.
Hence, the recommendation is , use Maven or similar where dependency version can be managed.
If you are doing this in the context of Scala within a Jupyter Notebook, you'll get this error. You have to install the Apache Toree kernel:
https://github.com/apache/incubator-toree
and create your notebooks with that kernel.
You also have to start the Jupyter Notebook with:
pyspark

How to compile a spark-cassandra programs using scala?

Lately I started learning spark and cassandra, I know that we can use spark in both python and scala and java, and I 've read docs on this website: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md, the thing is, after I create a program named testfile.scala with those codes the document says,(I don't know if I am right using .scala), however, i don't know how to compile it,can anyone guide me what to do with it?
Here are the testfile.scala:
import com.datastax.spark.connector._
import com.datastax.spark.connector.streaming._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ssc.actorStream[String](Props[SimpleStreamingActor], actorName, StorageLevel.MEMORY_AND_DISK)
val wc = stream.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)
Scala projects are compiled by scalac, but it's quite low level: you have to setup build paths and manage all dependencies, so most people fall back to some build tool such as sbt which will manage a lot of stuff for you. The other two commonly used built tools are maven, which is favored by java old-schoolers and gradle, which is more down to earth
> how to import spark-cassandra-connector
I've set up example project. Basically, you define all of your dependencies in built.sbt or it's analog, here is how dependency on spark-cassandra-connector is defined (line #12).
> And, is it a rule that we have to code with class or object
Yes and no. If you code with sbt, all your code files has to be wrapped into object, but, sbt allows you to code in it's shell and code that you input to it is not required to be wrapped (same rules as with ordinary scala REPL). Next, both IDEA and Eclipse have worksheet capabilities, so you can create test.sc and draft your code there.