spark-submit for a .scala file - scala

I have been running some test spark scala code using probably a bad way of doing things with spark-shell:
spark-shell --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
This would execute my code on spark and pop into the shell when done.
Now that I am trying to run this on a cluster, I think I need to use spark-submit, to which I thought would be:
spark-submit --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
but it does not like the .scala file, somehow does it have to be compiled into a class? the scala code is a simple scala file with several helper classes defined in it and no real main class so to speak. I don't see int he help files but maybe I am missing it, can I just spark-submit a file or do I have to somehow give it the class? Thus changing my scala code?
I did add this to my scala code too:
went from this
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
val sc = new SparkContext(conf)
To this:
val sc = new SparkContext(new SparkConf().setMaster("spark://192.20.0.71:7077")

There are 2 quick and dirty ways of doing this:
Without modifying the scala file
Simply use the spark shell with the -i flag:
$SPARK_HOME/bin/spark-shell -i neo4jsparkCluster.scala
Modifying the scala file to include a main method
a. Compile:
scalac -classpath <location of spark jars on your machine> neo4jsparkCluster
b. Submit it to your cluster:
/usr/lib/spark/bin/spark-submit --class <qualified class name> --master <> .

You will want to package your scala application with sbt and include Spark as a dependency within your build.sbt file.
See the self contained applications section of the quickstart guide for full instructions https://spark.apache.org/docs/latest/quick-start.html

You can take a look at the following Hello World example for Spark which packages your application as #zachdb86 already mentioned.
spark-hello-world

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

Why does the classpath used by spark-submit unexpectedly have jars from under the python installation?

I have a jar file that contains some Scala (and Java) code that I run using the following spark-submit command:
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf -verbose:class"
--conf 'spark.driver.extraJavaOptions=-verbose:class'
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--queue default /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run it, I get an error message - “IAMInstanceCredentialsProvider not found”, which is caused by a version mismatch. It seems IAMInstanceCredentialsProvider was added to hadoop-aws in version 3.3.0 and we want to use 3.2.1. I've gone through our maven dependencies and feel confident that we are not trying to use 3.3.x anywhere.
I've attempted to debug the problem by adding some "verbose" arguments to the command, and I've also added some debug code to MyClass to print out the classpath in effect, following the instructions from here.
When I look at the output, the classpath in effect when we run the spark-submit command includes a lot of jars included with Python, including /usr/local/lib/python3.7/site-packages/pyspark/jars/hadoop-client-api-3.3.1.jar. Thus far, I've been unable to figure out why we are loading jars from /usr/local/lib/python3.7.
Can anybody explain to me where those dependencies are coming from, or suggest a way that I could debug where those dependencies come from? I thought the python might be a result of some environment variable setting, but if so, it doesn't seem to be set at the top level:
set|grep -i python
doesn't return anything.

How to add jar to Spark in Pycharm

I want to debug Spark code in PyCharm because it is easier to debug. But I need to add a spark-redis.jar otherwise Failed to find data source: redis
The code to connect to redis is
spark = SparkSession \
.builder \
.appName("Streaming Image Consumer") \
.config("spark.redis.host", self.redis_host) \
.config("spark.redis.port", self.redis_port) \
.getOrCreate()
How to do fix it if using PyCharm?
I have tried adding spark.driver.extraClassPath in $SPARK_HOME/conf/spark-defaults.conf but it does not work.
I also tried adding environment variable PYSPARK_SUBMIT_ARGS --jars ... in run configuration but it raise other error
Adding spark.driver.extraClassPath to spark-defaults.conf works for me with Spark 2.3.1
cat /Users/oleksiidiagiliev/Soft/spark-2.3.1-bin-hadoop2.7/conf/spark-defaults.conf
spark.driver.extraClassPath /Users/oleksiidiagiliev/.m2/repository/com/redislabs/spark-redis/2.3.1-SNAPSHOT/spark-redis-2.3.1-SNAPSHOT-jar-with-dependencies.jar
Please note, this is a jar with dependencies (you can build one from sources using mvn clean install -DskipTests).
Aslo I added pyspark libraries and SPARK_HOME environment variable to PyCharm project as described here https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f

Run spark-shell from sbt

The default way of getting spark shell seems to be to download the distribution from the website. Yet, this spark issue mentions that it can be installed via sbt. I could not find documentation on this. In a sbt project that uses spark-sql and spark-core, no spark-shell binary was found.
How do you run spark-shell from sbt?
From the following URL:
https://bzhangusc.wordpress.com/2015/11/20/use-sbt-console-as-spark-shell/
If you already using Sbt for your project, it’s very simple to setup Sbt Console to replace Spark-shell command.
Let’s start from the basic case. When you setup the project with sbt, you can simply run the console as sbt console
Within the console, you just need to initiate SparkContext and SQLContext to make it behave like Spark Shell
scala> val sc = new org.apache.spark.SparkContext("localhell")
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

How to import own scala package using spark-shell?

I have written a class for spark-ml library that uses another classes from it.
If to be clear, my class is a wrapper for RandomForestClassifier.
Now I want to have an opportunity to import this class from spark-shell.
So the question is: how to make package containing my own class that it will be able to be imported from spark-shell? Many thanks!
If you want to import uncompiled files like Hello.scala, do below in spark shell:
scala> :load ./src/main/scala/Hello.scala
Read the docs:
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work. You can set which master the context connects to using the --master argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars argument. You can also add dependencies (e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates to the --packages argument. Any additional repositories where dependencies might exist (e.g. SonaType) can be passed to the --repositories argument.