Distributing a jar for use in pyspark - pyspark

I've built a jar that I can use from pyspark by adding it to ${SPARK_HOME}/jars and calling it using
spark._sc._jvm.com.mypackage.myclass.mymethod()
however what I'd like to do is bundle that jar into a python wheel so someone can pip install a the jar into their running pyspark/jupyter session. I'm not very familiar with python packaging, is it possible to distribute jars inside a wheel and have that jar be automatically available to pyspark?
I want to put a jar inside a wheel or egg (not even sure if I can do that???) and upon installation of said wheel/egg, out that jar in a place where it will be available to the jvm.
I guess what I'm really asking is, how do I make it easy for someone to install a 3rd party jar and use it from pyspark?

As you have mentioned above, and hope you have already used the --jars option and able to use function in pyspark. As understood your requirement correctly you want to add this jar in install package so that jar library will be available on each node of cluster.
There is one source found on databricks which talks about adding third party jar files pyspark python wheel install. See if that is only information you are looking at.
https://docs.databricks.com/libraries.html#upload-a-jar-python-egg-or-python-wheel

Related

compile scala-spark file to jar file

Im working on a project of frequent item sets, and I use the Algorithm FP-Growth, I depend on the version developed in Scala-Spark
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
I need to modify this code and recompile it to have a jar file that I can include it to Spark-shell , and call its functions in spark
the problem s that spark-shell is un interpreter , and it finds errors in this file, Ive tried Sbt with eclipse but it did not succeded .
what i need is compiler that can use the last version of scala and spark-shel libraries to compile this file to jar file.
Got your question now!
All you need to do is add dependency jars(scala, java, etc.,) with respect to the machine you are going to use you own jar. Later on add the jars to spark-shell and you can use it like below,
spark-shell --jars your_jar.jar
Follow this steps:
check out Spark repository
modify files to want to modify
build project
run ./dev/make-distribution.sh script, which is inside Spark repository
run Spark Shell from your Spark distribution

Use library in Spark-shell

I want to use this library in spark-shell and/or in a .scala file to manipulate some data. How do I do that? I cannot use maven.
EDIT for possible dupl: I also do not have a jar; if that is part of the solution, how do I make a jar from that library?
The library you reference is available on Maven Central, and spark-shell can automatically download libraries from Maven Central and a few other popular repositories if you give it the correct Maven coordinates. You don't need to explicitly use Maven. (In fact, it even lets you specify your own additional Maven repositories and searches those as well.) See http://spark.apache.org/docs/latest/rdd-programming-guide.html#using-the-shell
In your case specifically, the command should be something like
./bin/spark-shell --master local[4] --packages "dk.tbsalling:aismessages:2.2.1"
Note: You can browse https://spark-packages.org/ to find spark
packages.

spark: how to include dependencies with build/sbt compile

I am new to spark but am trying to do some development. I am following "Reducing Build Times" instructions from the spark developer page. After creating the normal assembly I have written some classes that are dependent on one specific jar. I test my package in the spark-shell in which I have been able to include my jar by using defining SPARK_CLASSPATH, but the problem lies in actually compiling my code. What I want to achieve is to include that jar when compiling my added package (with build/sbt compile). Could I do that by adding a path to my jar in build/sbt file or sbt-launch-lib.bash, and if so how?
(Side note: I do not want to yet include the jar in the assembly because as I go I make some changes to it, and so it would be inconvenient. I am using Spark 1.4)
Any help is appreciated!
Based on the answer in the comments above, it looks like you are trying to add your jar as a dependency to the the mllib project as you do development on mllib itself. You can accomplish this by modifying the pom.xml file in the mllib directory within the Spark distribution.
You can find instructions on how to add a local file as a dependency here - http://blog.valdaris.com/post/custom-jar/. I haven't used this approach myself to including local file as a dependency, but I think it should work.

Scala dependency on Spark installation

I am just getting started with Spark, so downloaded the for Hadoop 1 (HDP1, CDH3) binaries from here and extracted it on a Ubuntu VM. Without installing Scala, I was able to execute the examples in the Quick Start guide from the Spark interactive shell.
Does Spark come included with Scala? If yes, where are the libraries/binaries?
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?
As a side note, I observed that Spark has one of the best documentation around open source projects.
Does Spark come included with Scala? If yes, where are the libraries/binaries?
The project configuration is placed in project/ folder. I my case here it is:
$ ls project/
build.properties plugins.sbt project SparkBuild.scala target
When you do sbt/sbt assembly, it downloads appropriate version of Scala along with other project dependencies. Checkout the folder target/ for example:
$ ls target/
scala-2.9.2 streams
Note that Scala version is 2.9.2 for me.
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?
Yes. You can create a single assembly jar as described in Spark documentation
If your code depends on other projects, you will need to ensure they are also present on the slave nodes. A popular approach is to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark itself as a provided dependency; it need not be bundled since it is already present on the slaves. Once you have an assembled jar, add it to the SparkContext as shown here. It is also possible to submit your dependent jars one-by-one when creating a SparkContext.
Praveen -
checked now the fat-master jar.
/SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar
this jar included with all the scala binaries + spark binaries.
you are able to run because this file is added to your CLASSPAH when you run spark-shell
check here : run spark-shell > http:// machine:4040 > environment > Classpath Entries
if you downloaded pre build spark , then you don't need to have scala in nodes, just this file in CLASSAPATH in nodes is enough.
note: deleted the last answer i posted, cause it may mislead some one. sorry :)
You do need Scala to be available on all nodes. However, with the binary distribution via make-distribution.sh, there is no longer a need to install Scala on all nodes. Keep in mind the distinction between installing Scala, which is necessary to run the REPL, and merely packaging Scala as just another jar file.
Also, as mentioned in the file:
# The distribution contains fat (assembly) jars that include the Scala library,
# so it is completely self contained.
# It does not contain source or *.class files.
So Scala does indeed come along for the ride when you use make-distribution.sh.
From spark 1.1 onwards, there is no SparkBuild.scala
You ahve to make your changes in pom.xml and build using Maven

How to build a scala application that was created in eclipse scala plugin FROM THE CL

I have developed a scala application for the first time, but I have to deploy it with a "one-click" type script that can run and build the scala application from source WITHOUT ECLIPSE.
Since I'm completely new to scala I don't know how to tell it where all my source files are etc... to get it to build my app from the command line. I also have 2 3rd party .jar libraries that I need to tell the scala compiler to link to...
Any documentation on this? Or example command lines? My project hierarchy is:
src/packagename: contains all .scala
bin/packagename: contains all.class files
libs/ -> contains 2 .jar files I will need to import somehow
I'm working on debian linux
EDIT: I found this ability to export in eclipse so I created a .java file and called my main scala object from it. Then I exported as a runnable jar. However, when I go to run the new runnable jar "sudo java runnable.jar" it says "class not found exception: runnable.jar"
You should take a look at https://github.com/harrah/xsbt/wiki which is the common way to build a Scala project. Run through the tutorial in the wiki to learn how you should organise your directory structure, so that everything may run fine.
If you want to combine it with eclipse, checkout this plugin: https://github.com/typesafehub/sbteclipse