How to build and run Scala Spark locally - eclipse

I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.
So I have cloned the latest from Spark repo :
git clone https://github.com/apache/spark.git
Spark appears to be a Maven project so when I create it in Eclipse here is the structure :
Some of the top level folders also have pom files :
So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?

Building Spark locally, the short answer:
git clone git#github.com:apache/spark.git
cd spark
sbt/sbt compile
Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'.
To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.
Then, when creating the Spark Context, use sparkConfig.local[1] as master like:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("SparkDebugExample")
so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.
If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.

Related

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?
It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.
Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

Is there a config file when installing spark dependency with scala

I installed spark with sbt in project dependecies. Then I want to change variables of the spark env without doing it within my code with a .setMaster(). The problem is that i cannot find any config file on my computer.
This is because I have an error : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#my-mbp.domain_not_set.invalid:50487even after trying to change my hostname. Thus, I would like to go deep into spark library and try some things.
I tried pretty much everything that is on this so post : Invalid Spark URL in local spark session.
Many thanks
What worked for the issue:
export SPARK_LOCAL_HOSTNAME=localhost in shell profil (e.g. ~/.bash_profil)
SBT was not able to find the host even using the command just before running sbt. I had to put it in the profil to have a right context.

How to re-use compiled sources in different machines

To speed up our development workflow we split the tests and run each part on multiple agents in parallel. However, compiling test sources seem to take most of the time for the testing steps.
To avoid this, we pre-compile the tests using sbt test:compile and build a docker image with compiled targets.
Later, this image is used in each agent to run the tests. However, it seems to recompile the tests and application sources even though the compiled classes exists.
Is there a way to make sbt use existing compiled targets?
Update: To give more context
The question strictly relates to scala and sbt (hence the sbt tag).
Our CI process is broken down in to multiple phases. Its roughly something like this.
stage 1: Use SBT to compile Scala project into java bitecode using sbt compile We compile the test sources in the same test using sbt test:compile The targes are bundled in a docker image and pushed to the remote repository,
stage 2: We use multiple agents to split and run tests in parallel.
The tests run from the built docker image, so the environment is the
same. However, running sbt test causes the project to recompile even
through the compiled bitecode exists.
To make this clear, I basically want to compile on one machine and run the compiled test sources in another without re-compiling
Update
I don't think https://stackoverflow.com/a/37440714/8261 is the same problem because unlike it, I don't mount volumes or build on the host machine. Everything is compiled and run within docker but in two build stages. The file modified times and paths are retained the same because of this.
The debug output has something like this
Initial source changes:
removed:Set()
added: Set()
modified: Set()
Invalidated products: Set(/app/target/scala-2.12/classes/Class1.class, /app/target/scala-2.12/classes/graph/Class2.class, ...)
External API changes: API Changes: Set()
Modified binary dependencies: Set()
Initial directly invalidated classes: Set()
Sources indirectly invalidated by:
product: Set(/app/Class4.scala, /app/Class5.scala, ...)
binary dep: Set()
external source: Set()
All initially invalidated classes: Set()
All initially invalidated sources:Set(/app/Class4.scala, /app/Class5.scala, ...)
Recompiling all 304 sources: invalidated sources (266) exceeded 50.0% of all sources
Compiling 302 Scala sources and 2 Java sources to /app/target/scala-2.12/classes ...
It has no Initial source changes, but products are invalidated.
Update: Minimal project to reproduce
I created a minimal sbt project to reproduce the issue.
https://github.com/pulasthibandara/sbt-docker-recomplile
As you can see, nothing changes between the build stages, other than running in the second stage in a new step (new container).
While https://stackoverflow.com/a/37440714/8261 pointed at the right direction, the underlying issue and the solution for this was different.
Issue
SBT seems to recompile everything when it's run on different stages of a docker build. This is because docker compresses images created in each stage, which strips out the millisecond portion of the lastModifiedDate from sources.
SBT depends on lastModifiedDate when determining if sources have changed, and since its different (the milliseconds part) the build triggers a full recompilation.
Solution
Java 8:
Setting -Dsbt.io.jdktimestamps=true when running SBT as recommended in https://github.com/sbt/sbt/issues/4168#issuecomment-417655678 to workaround this issue.
Newer:
Follow recomendation in https://github.com/sbt/sbt/issues/4168#issuecomment-417658294
I solved the issue by setting SBT_OPTS env variable in the docker file like
ENV SBT_OPTS="${SBT_OPTS} -Dsbt.io.jdktimestamps=true"
The test project has been updated with this workaround.
Using SBT:
I think there is already an answer to this here: https://stackoverflow.com/a/37440714/8261
It looks tricky to get exactly right. Good luck!
Avoiding SBT:
If the above approach is too difficult (i.e. getting sbt test to consider that your test classes do not need re-compiling), you could instead avoid using sbt but instead run your test suite using java directly.
If you can get sbt to log the java command that it is using to run your test suite (e.g. using debug logging), then you could run that command on your test runner agents directly, which would completely preclude sbt re-compiling things.
(You might need to write the java command into a script file, if the classpath is too long to pass as a command-line argument in your shell. I have previously had to do that for a large project.)
This would be a much hackier approach that the one above, but might be quicker to get working.
A possible solution might be defining your own sbt task without dependencies or try to change the test task. For example you could create a task to run a JUnit runner if that was your testing framework. To define a task see this on Implementing Tasks.
You could even go as far as compiling sending the code and running the remotes from the same task as it is any scala code you want. From the sbt reference manual
You could be defining your own task, or you could be planning to redefine an existing task. Either way looks the same; use := to associate some code with the task key

Connect from a windows machine to Spark

I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!
There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.

How to install scala libraries in HDP (Hortonworks Data Platform)

thanks in advance for the time you according reading this and sorry for my bad english.
I am trying to use Spark streaming for real time data processing. I have Spark installed in HDP (Hortonworks Data Platform) and for my process I need to install a scala library for JSONparsing. I read a lot of things on internet about that but it just was for a simple Spark Cluster not for solution like HDP and CDH, I tried to adapt the solution but I couldn't, I don't find any scala files to install it. Does anybody know a solution or a tips to help me ?
Thank you
To load dependencies for Spark in Zeppelin you need to create a new cell and use the following:
%dep
// it's a good idea to do a reset first, but not required
z.reset()
// the following line will load directly from the Maven online repo
z.load("org.apache.spark:spark-streaming-karka_2.10:1.6.1")
Additional details on loading dependencies for Zeppelin can be found here:
https://zeppelin.apache.org/docs/latest/interpreter/spark.html#3-dynamic-dependency-loading-via-dep-interpreter
One thin to not here, is that dependency loading must be the first cell you run on your notebook, it will give you an error message if it's not. To get around this click on the Interpreter tab and click restart on the Spark Interpreter, then go back to your notebook and run the cell with the %dep