I installed spark with sbt in project dependecies. Then I want to change variables of the spark env without doing it within my code with a .setMaster(). The problem is that i cannot find any config file on my computer.
This is because I have an error : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#my-mbp.domain_not_set.invalid:50487even after trying to change my hostname. Thus, I would like to go deep into spark library and try some things.
I tried pretty much everything that is on this so post : Invalid Spark URL in local spark session.
Many thanks
What worked for the issue:
export SPARK_LOCAL_HOSTNAME=localhost in shell profil (e.g. ~/.bash_profil)
SBT was not able to find the host even using the command just before running sbt. I had to put it in the profil to have a right context.
Related
I am relatively new to Spark and Scala.
I have a scala application that runs in local mode both on my windows box and a Centos cluster.
As long as spark is in my classpath (i.e., pom.xml), spark runs as unit tests without the need for a SPARK_HOME. But then how do I set Spark properties such as spark.driver.memory?
If I do have an instance of spark running locally, my unit test application seems to ignore it when in local mode. I do not see any output on the spark console suggesting it is using the spark instance I started from the command line (via spark-shell command). Am I mistaken? If not, how do I get my scala application to use that instance?
EDITED to include useful info from comments as well
spark_shell is just an interactive shell, it stands alone and is not an "instance" that other processes should connect to. When you run your spark application through spark-submit (or just running your spark code) it will start its own instance of spark. If you need to set any properties they can be bassed in a system properties or through the spark-submit --conf parameters
spark-submit requires that first you use maven assembly plugin to compile your application jar and dependencies.
This should then be deployed to the SPARK_HOME directory
Then use the submit script which must also be deployed in SPARK_HOME
The spark-submit script looks like this:
./bin/spark-submit --class xxx.ml.PipelineStart
--master local[*]
./xxx/myApp-1.0-SNAPSHOT-jar-with-dependencies.jar 100
You can set options in your SparkConf. Look at the methods available in the documentation.
There are explicit methods like SparkConf.setMaster to set certain properties. However, if you don't see a method to explicitly set a property, then just use SparkConf.set. It takes a key and a value, and the configurable properties are all found here.
If you're curious about what a property is set to, then you can also use SparkConf.get to check that out.
I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. sbt run when executed in spark.
When executed via spark-submit the error is:
java.lang.IllegalArgumentException: The input argument(s) may not be null.
at javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:178)
at org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
Which looks like some native dependency is not called correctly.
Assuming something is wrong with the class path I tried to run a plain java/scala function. but this one works just fine.
In fact, the exact same problem occurs when Nifi is calling the parse function.
Is spark messing with the class paths? What is different from running the jar natively via java-jar or through spark or NiFi? Both show the same problem even when concurrency is disabled and they run only on a single thread.
JAI vendorname == null is somewhat similar as it shows what can go wrong when running a jar with JAI. I could not identify this as the exact same problem though.
I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem
Due to the dependency on the build process & packaging of native libraries I think it will not be possible to include snippets directly in this posting.
edit
I am pretty convinced this has to do the the assembly merge strategy, so far I could not find one which works.
Below you can see that the Vectorize operation is missing on sparks class path
edit 2
I think spark / NiFis class loader will not load some of the required registry files for JAI. A plain java app works fine with these assembly/ fat-jar settings.
I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!
There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.
thanks in advance for the time you according reading this and sorry for my bad english.
I am trying to use Spark streaming for real time data processing. I have Spark installed in HDP (Hortonworks Data Platform) and for my process I need to install a scala library for JSONparsing. I read a lot of things on internet about that but it just was for a simple Spark Cluster not for solution like HDP and CDH, I tried to adapt the solution but I couldn't, I don't find any scala files to install it. Does anybody know a solution or a tips to help me ?
Thank you
To load dependencies for Spark in Zeppelin you need to create a new cell and use the following:
%dep
// it's a good idea to do a reset first, but not required
z.reset()
// the following line will load directly from the Maven online repo
z.load("org.apache.spark:spark-streaming-karka_2.10:1.6.1")
Additional details on loading dependencies for Zeppelin can be found here:
https://zeppelin.apache.org/docs/latest/interpreter/spark.html#3-dynamic-dependency-loading-via-dep-interpreter
One thin to not here, is that dependency loading must be the first cell you run on your notebook, it will give you an error message if it's not. To get around this click on the Interpreter tab and click restart on the Spark Interpreter, then go back to your notebook and run the cell with the %dep
I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.
So I have cloned the latest from Spark repo :
git clone https://github.com/apache/spark.git
Spark appears to be a Maven project so when I create it in Eclipse here is the structure :
Some of the top level folders also have pom files :
So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?
Building Spark locally, the short answer:
git clone git#github.com:apache/spark.git
cd spark
sbt/sbt compile
Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'.
To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.
Then, when creating the Spark Context, use sparkConfig.local[1] as master like:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("SparkDebugExample")
so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.
If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.