Connect from a windows machine to Spark - scala

I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!

There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.

Related

Scla/Java library not installing on execution of Databricks Notebook

At work I have a Scala Databricks Notebook that uses many libraries imports, both from Maven and from some JAR files. The issue I have is that when I plan jobs on this Notebook, it sometimes fails (completely randomly but mostly 1 time over 10 runs) because it executes the cells before all libraries are installed. Thus the job fails and I have to go launch it manually. Such comportment from this Databricks' product is far from being professional as I can't use it in production because it sometimes fails.
I tried to put a Thread.Sleep() of 1 minute or so before all my imports, but it does not change anything. For Python there's the dbutils.library.installPyPI("library-name") but there's no such thing for Scala in the Dbutils documentation.
So does anyone have had the same issue and if so, how did you solve it ?
Thank you !
Simply put for prod scheduled jobs use New Job Cluster and avoid All Purpose Cluster.
New Job Clusters are dedicated clusters created and started when you run a task and terminate immediately after the task completes. In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment.
In the UI, when setting up your notebook job select a New Job Cluster and afterwards add all the dependent libraries to the job.
The pricing is different for New Job Cluster. I would say it ends up cheaper.
Note: Use Databricks pools to reduce cluster start and auto-scaling times (if it's an issue to begin with).

On Mac how to start a spark-shell in same environment as the environment running in an Intellij project?

I am working on a spark project using scala and maven, and some time I feel it would be very helpful if I can ran the project in an interactive mode.
My question if it is possible (and how) to bring up a spark environment in terminal that same as the environment running in a IntelliJ project?
Or even better (if it is possible) -- start a PERL environment, under IntelliJ debug model, during code ceased running at a break point. So we can continue play with all variables and instances created so far.
Yes, it is possible, though not very straightforward. I first build Fat jar using sbt assembly plugin (https://github.com/sbt/sbt-assembly) and then use a debug configuration like the one below to start it in debugger. Note that org.apache.spark.deploy.SparkSubmit is used as a main class, not your application main class. You app main class is specified in the --class parameter instead.
It is a bit tedious to have to create app jar file before starting each debug session (if sources were changed). I couldn't get SparkSubmit to work with the compiled by IntelliJ class files directly. I'd be happy to hear about alternative ways of doing this.
*Main class:*
org.apache.spark.deploy.SparkSubmit
*VM Options:*
-cp <SPARK_DIR>/conf/:<SPARK_DIR>/jars/* -Xmx6g -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/tmp
*Program arguments:*
--master
local[*]
--class
com.example.YourSparkApp
<PROJECT_DIR>/target/scala-2.11/YourSparkAppFat.jar
<APP_ARGS>
If you don't care much about initialization or can insert a loop in the code where the app waits for a keystroke or any other kind of signal before continuing, then you can start you app as usual and simply attach IntelliJ to the app process (Run > Attach to Local Process...).

Spark without SPARK_HOME

I am relatively new to Spark and Scala.
I have a scala application that runs in local mode both on my windows box and a Centos cluster.
As long as spark is in my classpath (i.e., pom.xml), spark runs as unit tests without the need for a SPARK_HOME. But then how do I set Spark properties such as spark.driver.memory?
If I do have an instance of spark running locally, my unit test application seems to ignore it when in local mode. I do not see any output on the spark console suggesting it is using the spark instance I started from the command line (via spark-shell command). Am I mistaken? If not, how do I get my scala application to use that instance?
EDITED to include useful info from comments as well
spark_shell is just an interactive shell, it stands alone and is not an "instance" that other processes should connect to. When you run your spark application through spark-submit (or just running your spark code) it will start its own instance of spark. If you need to set any properties they can be bassed in a system properties or through the spark-submit --conf parameters
spark-submit requires that first you use maven assembly plugin to compile your application jar and dependencies.
This should then be deployed to the SPARK_HOME directory
Then use the submit script which must also be deployed in SPARK_HOME
The spark-submit script looks like this:
./bin/spark-submit --class xxx.ml.PipelineStart
--master local[*]
./xxx/myApp-1.0-SNAPSHOT-jar-with-dependencies.jar 100
You can set options in your SparkConf. Look at the methods available in the documentation.
There are explicit methods like SparkConf.setMaster to set certain properties. However, if you don't see a method to explicitly set a property, then just use SparkConf.set. It takes a key and a value, and the configurable properties are all found here.
If you're curious about what a property is set to, then you can also use SparkConf.get to check that out.

How use eclipse debug hadoop wordcount?

I want to use eclipse debug the wordcount, because I want to see the job how to run in the JobTracker. But hadoop use Proxy, I don't know the concrete process that job how to run in the JobTracker. How should I debug?
You are better off debugging "locally" against a single-node cluster (e.g. one of the sandboxes supplied by Cloudera or Hortonworks): that way you can truly step through the code as there is only one mapper/reducer in play. That's been my approach at least: usually the problems I had to debug were to do with the contents of specific files; I just copied over the relevant file to my test system and debugged there.

How to build and run Scala Spark locally

I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.
So I have cloned the latest from Spark repo :
git clone https://github.com/apache/spark.git
Spark appears to be a Maven project so when I create it in Eclipse here is the structure :
Some of the top level folders also have pom files :
So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?
Building Spark locally, the short answer:
git clone git#github.com:apache/spark.git
cd spark
sbt/sbt compile
Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'.
To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.
Then, when creating the Spark Context, use sparkConfig.local[1] as master like:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("SparkDebugExample")
so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.
If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.