Spark how to prefer class from packaged jar - scala

I am using sbt assembly plugin to create a fat jar. I need some jars which are part of default hadoop/spark but with newer versions.
I want spark worker jvm to prefer the version that is packaged with my fat jar file and not the default hadoop/spark distribution. How can I do this?

The solution to this is to set spark.{driver,executor}.userClassPathFirst in configuration(--conf option) while submitting the spark application. This will first include jars from uber jar and then from spark classpath.
Other solution is to use shading in sbt assembly. And shade the jars in our uber jar whose previous version are included with spark.

Related

IntelliJ Scala maven build jar not generating class files

I have created spark scala(version 2.11) application and try to build using maven(version-3) using IntelliJ. At first time,able to compile and built the jar using maven successfully and able to test spark application using jar on cluster as well.Next time,I have modified some of the existing scala class code and tried to build again, code compiled and generate jar file successfully without any issues but there are no scala classes in latest jar file.I would like to know why maven build is not generating class file when I build.Can you please let me know what could be the problem and how Can I fix it ?
The easiest way to build scala applications for spark is to use SBT and fat jar plugin. Details were already described there:
How to build an Uber JAR (Fat JAR) using SBT within IntelliJ IDEA?
Just don't forget to exclude spark jars from fat jar with provided.

How to add external jar files to a spark scala project

I am trying to use an LSH implementation of Scala(https://github.com/marufaytekin/lsh-spark) in my Spark project.I cloned the repository with some changes to the sbt file (added Organisation)
To use this implementation , I compiled it using sbt compile and moved the jar file to the "lib" folder of my project and updated the sbt configuration file of my project , which looks like this ,
Now when I try to compile my project using sbt compile , It fails to load the external jar file ,showing the error message "unresolved dependency: com.lendap.spark.lsh.LSH#lsh-scala_2.10;0.0.1-SNAPSHOT: not found".
Am i following the right steps for adding an external jar file ?
How do i solve the dependency issue
As an alternative, you can build the lsh-spark project and add the jar in your spark application.
To add the external jars, addJar option can be used while executing spark application. Refer Running spark application on yarn
This issue isn't related to spark but to sbt configuration.
Make sure you followed the correct folder structure imposed by sbt and added your jar in the lib folder, as explained here - lib folder should be at the same level as build.sbt (cf. this post).
You might also want to check out this SO post.

How to build a bundle sbt from source for offline use?

My goal is to have a sbt jar file with all dependencies in order to create a debian package, so it could be install on machine without check/install package at first run.
Is it the right choice use sbt-assembly to build a sbt jar with all dependencies?
The sbt binary version doesn't come with dependecies and sbt download them at first run.
I don't fully understand your use case, but would sbt-native-packager .deb format be a good fit?

Scala IDE and Apache Spark -- different scala library version found in the build path

I have some main object:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Main {
def main(args: Array[String]) {
val sc = new SparkContext(
new SparkConf().setMaster("local").setAppName("FakeProjectName")
)
}
}
...then I add spark-assembly-1.3.0-hadoop2.4.0.jar to the build path in Eclipse from
Project > Properties... > Java Build Path :
...and this warning appears in the Eclipse console:
More than one scala library found in the build path
(C:/Program Files/Eclipse/Indigo 3.7.2/configuration/org.eclipse.osgi/bundles/246/1/.cp/lib/scala-library.jar,
C:/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar).
This is not an optimal configuration, try to limit to one Scala library in the build path.
FakeProjectName Unknown Scala Classpath Problem
Then I remove Scala Library [2.10.2] from the build path, and it still works. Except now this warning appears in the Eclipse console:
The version of scala library found in the build path is different from the one provided by scala IDE:
2.10.4. Expected: 2.10.2. Make sure you know what you are doing.
FakeProjectName Unknown Scala Classpath Problem
Is this a non-issue? Either way, how do I fix it?
This is often a non-issue, especially when the version difference is small, but there are no guarantees...
The problem is (as stated in the warning) that your project has two Scala libraries on the class path. One is explicitly configured as part of the project; this is version 2.10.2 and is shipped with the Scala IDE plugins. The other copy has version 2.10.4 and is included in the Spark jar.
One way to fix the problem is to install a different version of Scala IDE, that ships with 2.10.4. But this is not ideal. As noted here, Scala IDE requires every project to use the same library version:
http://scala-ide.org/docs/current-user-doc/gettingstarted/index.html#choosing-what-version-to-install
A better solution is to clean up the class path by replacing the Spark jar you are using. The one you have is an assembly jar, which means it includes every dependency used in the build that produced it. If you are using sbt or Maven, then you can remove the assembly jar and simply add Spark 1.3.0 and Hadoop 2.4.0 as dependencies of your project. Every other dependency will be pulled in during your build. If you're not using sbt or Maven yet, then perhaps give sbt a spin - it is really easy to set up a build.sbt file with a couple of library dependencies, and sbt has a degree of support for specifying which library version to use.
The easiest solution:
In Eclipse :
1. Project/ (righclick) Properties
2. Go to Scala Compiler
3. click Use Project Settings
4. set Scala Installation to a compatible version. Generally Fixed Scala Installation 2.XX.X (build-in)
5. Rebuild the project.
There are 2 types of Spark JAR files (just by looking at the Name):
- Name includes the word "assembly" and not "core" (has Scala inside)
- Name includes the word "core" and not "assembly" (no Scala inside).
You should include the "core" type in your Build Path via “Add External Jars”
(the version you need) since the Scala IDE already shoves one Scala for you.
Alternatively, you can just take advantage of the SBT and add the following
Dependency (again, pay attention to the versions you need):
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
Then you should NOT include “forcefully” any spark JAR in the Build Path.
Happy sparking:
Zar
>

SBT exclude scala-library.jar from a war file packaged using xsbt-web-plugin

I have a project that only needs to take a proguard-constructed jar file, which is built in a separate SBT project and contains all classes needed to run as a servlet, and create a war file out of it.
The dependency is properly packaged into the war, and the transitive jars are excluded correctly using notTransitive(), but scala-library.jar continues to be placed into the war file as well. This is not desired, since the proguard-built jar contains those scala classes that are necessary for the servlet filter to run. The present project just needs to take that dependent jar, add a web.xml, and package it into a war file.
What is the simplest way (preferably using a build.sbt file) to get the war packaging mechanism from the xsbt-web-plugin to exclude the scala-library.jar?
This should work, it .sbt:
autoScalaLibrary := false