loading resources in spark-shell

loading resources in spark-shell - scala

normally, loading resources in the Scala REPL is done like this:
getClass().getClassLoader().getResource("/resource-file")
see here
but this doesn't find resources from jars I load using the usual startup
spark-shell --jars list-of-jars
How are resources loaded in spark-shell? (am I referencing the wrong ClassLoader?)

Please remove the prefix "/". I tested in Spark shell and both getClass().getClassLoader().getResource("resource-file") and Thread.currentThread().getContextClassLoader().getResource("resource-file") worked. However, I would recommend using Thread.currentThread().getContextClassLoader() since it doesn't rely on what getClass() returns.

Related

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?

It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.

Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

Is there a config file when installing spark dependency with scala

I installed spark with sbt in project dependecies. Then I want to change variables of the spark env without doing it within my code with a .setMaster(). The problem is that i cannot find any config file on my computer.
This is because I have an error : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#my-mbp.domain_not_set.invalid:50487even after trying to change my hostname. Thus, I would like to go deep into spark library and try some things.
I tried pretty much everything that is on this so post : Invalid Spark URL in local spark session.
Many thanks

What worked for the issue:
export SPARK_LOCAL_HOSTNAME=localhost in shell profil (e.g. ~/.bash_profil)
SBT was not able to find the host even using the command just before running sbt. I had to put it in the profil to have a right context.

Spark without SPARK_HOME

I am relatively new to Spark and Scala.
I have a scala application that runs in local mode both on my windows box and a Centos cluster.
As long as spark is in my classpath (i.e., pom.xml), spark runs as unit tests without the need for a SPARK_HOME. But then how do I set Spark properties such as spark.driver.memory?
If I do have an instance of spark running locally, my unit test application seems to ignore it when in local mode. I do not see any output on the spark console suggesting it is using the spark instance I started from the command line (via spark-shell command). Am I mistaken? If not, how do I get my scala application to use that instance?

EDITED to include useful info from comments as well
spark_shell is just an interactive shell, it stands alone and is not an "instance" that other processes should connect to. When you run your spark application through spark-submit (or just running your spark code) it will start its own instance of spark. If you need to set any properties they can be bassed in a system properties or through the spark-submit --conf parameters
spark-submit requires that first you use maven assembly plugin to compile your application jar and dependencies.
This should then be deployed to the SPARK_HOME directory
Then use the submit script which must also be deployed in SPARK_HOME
The spark-submit script looks like this:
./bin/spark-submit --class xxx.ml.PipelineStart
--master local[*]
./xxx/myApp-1.0-SNAPSHOT-jar-with-dependencies.jar 100
You can set options in your SparkConf. Look at the methods available in the documentation.
There are explicit methods like SparkConf.setMaster to set certain properties. However, if you don't see a method to explicitly set a property, then just use SparkConf.set. It takes a key and a value, and the configurable properties are all found here.
If you're curious about what a property is set to, then you can also use SparkConf.get to check that out.

JAI can't execute in native spark - only in sbt and as a separate scala function

I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. sbt run when executed in spark.
When executed via spark-submit the error is:
java.lang.IllegalArgumentException: The input argument(s) may not be null.
at javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:178)
at org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
Which looks like some native dependency is not called correctly.
Assuming something is wrong with the class path I tried to run a plain java/scala function. but this one works just fine.
In fact, the exact same problem occurs when Nifi is calling the parse function.
Is spark messing with the class paths? What is different from running the jar natively via java-jar or through spark or NiFi? Both show the same problem even when concurrency is disabled and they run only on a single thread.
JAI vendorname == null is somewhat similar as it shows what can go wrong when running a jar with JAI. I could not identify this as the exact same problem though.
I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem
Due to the dependency on the build process & packaging of native libraries I think it will not be possible to include snippets directly in this posting.
edit
I am pretty convinced this has to do the the assembly merge strategy, so far I could not find one which works.
Below you can see that the Vectorize operation is missing on sparks class path
edit 2
I think spark / NiFis class loader will not load some of the required registry files for JAI. A plain java app works fine with these assembly/ fat-jar settings.

Using spark-submit without --class argument

I'm new to Spark and Scala, but hopefully this isn't a redundant/stupid question - I haven't been able to find the answer yet.
I have compiled a fat jar with the sbt-assembly tool, and the MANIFEST file includes the line MainClass: com.package.MyMainClass. However, spark-submit still demands that I use the --class argument to define the main class. From this Spark configuration page, I gather that spark-submit gets its configuration details from the conf/spark-defaults.conf file. My other properties (spark.master, spark.app.name) seem to load just fine without command line arguments, but I haven't been able to find a way to specify the project's main class in this file. I've randomly tried things like spark.class main.class and just class, but obviously stabbing in the dark isn't going that well.
Any ideas? I want to avoid having really ugly scripts to deploy applications to clusters when spark-submit MyJar.jar is so clean. Thanks.

Looking at the source code of org.apache.spark.deploy.SparkSubmitArguments.scala here, it looks like it should pick up your Main-Class manifest attribute:
mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class")
I haven't tested this, but try replacing 'MainClass' with 'Main-Class'.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse