Using spark-submit without --class argument - scala

I'm new to Spark and Scala, but hopefully this isn't a redundant/stupid question - I haven't been able to find the answer yet.
I have compiled a fat jar with the sbt-assembly tool, and the MANIFEST file includes the line MainClass: com.package.MyMainClass. However, spark-submit still demands that I use the --class argument to define the main class. From this Spark configuration page, I gather that spark-submit gets its configuration details from the conf/spark-defaults.conf file. My other properties (spark.master, spark.app.name) seem to load just fine without command line arguments, but I haven't been able to find a way to specify the project's main class in this file. I've randomly tried things like spark.class main.class and just class, but obviously stabbing in the dark isn't going that well.
Any ideas? I want to avoid having really ugly scripts to deploy applications to clusters when spark-submit MyJar.jar is so clean. Thanks.

Looking at the source code of org.apache.spark.deploy.SparkSubmitArguments.scala here, it looks like it should pick up your Main-Class manifest attribute:
mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class")
I haven't tested this, but try replacing 'MainClass' with 'Main-Class'.

Related

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?
It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.
Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

Spark without SPARK_HOME

I am relatively new to Spark and Scala.
I have a scala application that runs in local mode both on my windows box and a Centos cluster.
As long as spark is in my classpath (i.e., pom.xml), spark runs as unit tests without the need for a SPARK_HOME. But then how do I set Spark properties such as spark.driver.memory?
If I do have an instance of spark running locally, my unit test application seems to ignore it when in local mode. I do not see any output on the spark console suggesting it is using the spark instance I started from the command line (via spark-shell command). Am I mistaken? If not, how do I get my scala application to use that instance?
EDITED to include useful info from comments as well
spark_shell is just an interactive shell, it stands alone and is not an "instance" that other processes should connect to. When you run your spark application through spark-submit (or just running your spark code) it will start its own instance of spark. If you need to set any properties they can be bassed in a system properties or through the spark-submit --conf parameters
spark-submit requires that first you use maven assembly plugin to compile your application jar and dependencies.
This should then be deployed to the SPARK_HOME directory
Then use the submit script which must also be deployed in SPARK_HOME
The spark-submit script looks like this:
./bin/spark-submit --class xxx.ml.PipelineStart
--master local[*]
./xxx/myApp-1.0-SNAPSHOT-jar-with-dependencies.jar 100
You can set options in your SparkConf. Look at the methods available in the documentation.
There are explicit methods like SparkConf.setMaster to set certain properties. However, if you don't see a method to explicitly set a property, then just use SparkConf.set. It takes a key and a value, and the configurable properties are all found here.
If you're curious about what a property is set to, then you can also use SparkConf.get to check that out.

JAI can't execute in native spark - only in sbt and as a separate scala function

I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. sbt run when executed in spark.
When executed via spark-submit the error is:
java.lang.IllegalArgumentException: The input argument(s) may not be null.
at javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:178)
at org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
Which looks like some native dependency is not called correctly.
Assuming something is wrong with the class path I tried to run a plain java/scala function. but this one works just fine.
In fact, the exact same problem occurs when Nifi is calling the parse function.
Is spark messing with the class paths? What is different from running the jar natively via java-jar or through spark or NiFi? Both show the same problem even when concurrency is disabled and they run only on a single thread.
JAI vendorname == null is somewhat similar as it shows what can go wrong when running a jar with JAI. I could not identify this as the exact same problem though.
I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem
Due to the dependency on the build process & packaging of native libraries I think it will not be possible to include snippets directly in this posting.
edit
I am pretty convinced this has to do the the assembly merge strategy, so far I could not find one which works.
Below you can see that the Vectorize operation is missing on sparks class path
edit 2
I think spark / NiFis class loader will not load some of the required registry files for JAI. A plain java app works fine with these assembly/ fat-jar settings.

Drop into a Scala interpreter in Spark script?

I'm using Scala 2.11.8 and Spark 2.1.0. I'm totally new to Scala.
Is there a simple way to add a single line breakpoint, similar to Python:
import pdb; pdb.set_trace()
where I'll be dropped into a Scala shell and I can inspect what's going on at that line of execution in the script? (I'd settle for just the end of the script, too...)
I'm currently starting my scripts like so:
$SPARK_HOME/bin/spark-submit --class "MyClassName" --master local target/scala-2.11/my-class-name_2.11-1.0.jar
Is there a way to do this? Would help debugging immensely.
EDIT: The solutions in this other SO post were not very helpful / required lots of boilerplate + didn't work.
I would recommend one of the following two options:
Remote debugging & IntelliJ Idea's "evaluate expression"
The basic idea here is that you debug your app like you would if it was just an ordinary piece of code debugged from within your IDE. The Run->Evaluate expression function allows you to prototype code and you can use most of the debugger's usual variable displays, step (over) etc functionality. However, since you're not running the application from within your IDE, you need to:
Setup the IDE for remote debugging, and
Supply the application with the correct Java options for remote debugging.
For 1, go to Run->Edit configurations, hit the + button in the top right hand corner, select remote, and copy the content of the text field under Command line arguments for running remote JVM (official help).
For 2, you can use the SPARK_SUBMIT_OPTS environment variable to pass those JVM options, e.g.:
SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
$SPARK_HOME/bin/spark-submit --class Main --master "spark://127.0.0.1:7077" \
./path/to/foo-assembly-1.0.0.jar
Now you can hit the debug button, and set breakpoints etc.
Apache Zeppelin
If you're writing more script-style Scala, you may find it helpful to write it in a Zeppelin Spark Scala interpreter. While it's more like Jupyter/IPython notebooks/the ipython shell than (i)pdb, this does allow you to inspect what's going on at runtime. This will also allow you to graph your data etc. I'd start with these docs.
Caveat
I think the above will only allow debugging code running on the Driver node, not on the Worker nodes (which run your actual map, reduce etc functions). If you for example set a breakpoint inside an anonymous function inside myDataFrame.map{ ... }, it probably won't be hit, since that's executed on some worker node. However, with e.g. myDataFrame.head and the evaluate expression functionality I've been able to fulfil most of my debugging needs. Having said that, I've not tried to specifically pass Java options to executors, so perhaps it's possible (but probably tedious) to get it work.

loading resources in spark-shell

normally, loading resources in the Scala REPL is done like this:
getClass().getClassLoader().getResource("/resource-file")
see here
but this doesn't find resources from jars I load using the usual startup
spark-shell --jars list-of-jars
How are resources loaded in spark-shell? (am I referencing the wrong ClassLoader?)
Please remove the prefix "/". I tested in Spark shell and both getClass().getClassLoader().getResource("resource-file") and Thread.currentThread().getContextClassLoader().getResource("resource-file") worked. However, I would recommend using Thread.currentThread().getContextClassLoader() since it doesn't rely on what getClass() returns.