How to import own scala package using spark-shell? - scala

I have written a class for spark-ml library that uses another classes from it.
If to be clear, my class is a wrapper for RandomForestClassifier.
Now I want to have an opportunity to import this class from spark-shell.
So the question is: how to make package containing my own class that it will be able to be imported from spark-shell? Many thanks!

If you want to import uncompiled files like Hello.scala, do below in spark shell:
scala> :load ./src/main/scala/Hello.scala

Read the docs:
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work. You can set which master the context connects to using the --master argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars argument. You can also add dependencies (e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates to the --packages argument. Any additional repositories where dependencies might exist (e.g. SonaType) can be passed to the --repositories argument.

Related

Scala/Spark/Databricks: How to import code from user-created JAR?

I have a JAR that I created with intellij and sbt, that defines a case class and Object. I've uploaded it to my databricks workspace, and attached it to a cluster as a library.
How do I actually import code from it/reference it in my notebook? So far all I can think of is to try
import NameOfJar._
which gives
java.lang.NoClassDefFoundError: DataFrameComparison$
Do I need to have built the jar differently somehow? (Package statement or something?)
you should import import packageName._, jar name is not used in import statement. It should work the same as in usual local java/scala code.
You can check this article for details - https://docs.databricks.com/libraries.html
Btw, does your notebook fail on import itself, or later, when you're trying to use class, that exists inside jar?

How to add jar files in pyspark anaconda?

from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g")
sc=SparkContext.getOrCreate(conf)
dfv = sc.textFile("./part-001*.gz")
I have install pyspark thru anaconda and I can import pyspark in anaconda python. But I don't know how to add jar files in conf.
I tried
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.packages','file:///XXX,jar')
but it doesn't work.
Any proper way to add jar file here ?
The docs say:
spark.jars.packages: Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management.
Instead, you should simply use spark.jars:
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
So:
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.files','file:///XXX.jar')

spark-submit for a .scala file

I have been running some test spark scala code using probably a bad way of doing things with spark-shell:
spark-shell --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
This would execute my code on spark and pop into the shell when done.
Now that I am trying to run this on a cluster, I think I need to use spark-submit, to which I thought would be:
spark-submit --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
but it does not like the .scala file, somehow does it have to be compiled into a class? the scala code is a simple scala file with several helper classes defined in it and no real main class so to speak. I don't see int he help files but maybe I am missing it, can I just spark-submit a file or do I have to somehow give it the class? Thus changing my scala code?
I did add this to my scala code too:
went from this
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
val sc = new SparkContext(conf)
To this:
val sc = new SparkContext(new SparkConf().setMaster("spark://192.20.0.71:7077")
There are 2 quick and dirty ways of doing this:
Without modifying the scala file
Simply use the spark shell with the -i flag:
$SPARK_HOME/bin/spark-shell -i neo4jsparkCluster.scala
Modifying the scala file to include a main method
a. Compile:
scalac -classpath <location of spark jars on your machine> neo4jsparkCluster
b. Submit it to your cluster:
/usr/lib/spark/bin/spark-submit --class <qualified class name> --master <> .
You will want to package your scala application with sbt and include Spark as a dependency within your build.sbt file.
See the self contained applications section of the quickstart guide for full instructions https://spark.apache.org/docs/latest/quick-start.html
You can take a look at the following Hello World example for Spark which packages your application as #zachdb86 already mentioned.
spark-hello-world

Unable to use Apache Commons CLI Option.builder() in Scala

In a spark shell or application (written in Scala/maven build), I am unable to use the static builder method from the Apache Commons CLI package. I have confirmed that I am including the jar in the class path and have access to the Option class along with other classes in the package like Options, DefaultParser, etc. Why can I not use this public static method in Scala?
import org.apache.commons.cli.Option
val opt = Option.builder("foo").build()
error: value builder is not a member of object org.apache.commons.cli.Option
I can however see the static fields Option.UNINITIALIZED and Option.UNLIMITED_VALUES
using commons-cli 1.3.1
Scala version: 2.11.8
Spark version: 2.2.0
command to start the shell: spark-shell --jars .m2/repository/commons-cli/commons-cli/1.3.1/commons-cli-1.3.1.jar
Let me help you clarify your problem scenario.
You can open your .idea folder, find that it have some internal jar dependencies already, and of the list commons_cli exists, but 1.2 version.
This would lead to class collision.
The solution is straightforward, refer the doc, use the compatible constructor method.

Compiling a scala class and make it available in Scala Context

I have a class written in Scala and I am trying to make it available to the Scala Context so that I can make use of it for further processing. The problem is that I need to run this from the shell and I am having a hard time figuring out how to compile the class and make it available to the context.
I am aware of compiling the class and making use of directly, but I am not able to figure out how to do the same on the Scala shell. Any pointers in this regard would be great.
In the Scala REPL you can use the command :cp <path> to add a directory or JAR (that contains your compiled Scala class) to the classpath, so that it is available for the REPL to use.
(Ofcourse, replace <path> in that command with the actual directory or JAR path).
To see what other commands are available in the Scala REPL, use the command :help.