Running Scala Spark Jobs on Existing EMR - scala

I have Spark Job aggregationfinal_2.11-0.1 jar which I am running on my machine.the composition of it is as follows :
package deploy
object FinalJob {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(s"${this.getClass.getSimpleName}")
.config("spark.sql.shuffle.partitions", "4")
.getOrCreate()
//continued code
}
}
When I am running this code in local mode, it is running fine but when I am deploying this on the EMR cluster with putting its jar in main node.It is giving error as :
ClassNotFoundException : deploy.FinalJob
What am i missing here?

The best option is to deploy your uber jar(you can use sbt assembly plugin to build jar) to s3 and add spark step to EMR cluster. Please check:http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html

try to unjar it to some folder and look for target/classes using the below command jar -xvf myapp.jar. If the target classes is not containing the class you are executing then there is an issue with the way you build your jar. I would recommend maven assembly to be in your pom for packaging.

Related

How to avoid jar conflicts in a databricks workspace with multiple clusters and developers working in parallel?

We are working in an environment where multiple developers upload jars to a Databricks cluster with the following configuration:
DBR: 7.3 LTS
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
Scala: 2.12.10
Python: 3.7.5
R: R version 3.6.3 (2020-02-29)
Delta Lake: 0.7.0
Build tool: Maven
Below is our typical workflow:
STEP 0:
Build version 1 of the jar (DemoSparkProject-1.0-SNAPSHOT.jar) with the following object:
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 1.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
STEP 1:
Uninstall the old jar(s) from the cluster, and keep pushing new changes in subsequent versions with small changes to the logic. Hence, we push jars with versions 2.0-SNAPSHOT, 3.0-SNAPSHOT etc.
At a point in time, when we push the same object with the following code in the jar say (DemoSparkProject-4.0-SNAPSHOT.jar):
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 4.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
When we import this object in the notebook and run the main function we still get the old snapshot version jar println statement (EntryObjectOne: This object is from 1.0 SNAPSHOT JAR). This forces us from running a delete on the dbfs:/FileStore/jars/* and restarting the cluster and pushing the latest snapshot again to make it work.
In essence when I run sc.listJars() the active jar in the driver is the latest 4.0-SNAPSHOT jar. Yet, I still see the logic from old snapshot jars even though they are not installed on the cluster at runtime.
Resolutions we tried/implemented:
We tried using the maven shade plugin, but unfortunately, Scala does not support it. (details here).
We delete the old jars from dbfs:/FileStore/jars/* and restart the cluster and install the new jars regularly. This works, but a better approach can definitely help. (details here).
Changing the classpath manually and building the jar with different groupId using maven also helps. But with lots of objects and developers working in parallel, it is difficult to keep track of these changes.
Is this the right way of working with multiple jar versions in DataBricks? If there is a better way to handle this version conflict issue in DataBricks it will help us a lot.
You can't do with libraries packaged as Jar - when you install library, it's put into classpath, and will be removed only when you restart the cluster. Documentation says explicitly about that:
When you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the cluster, the status of the uninstalled library appears as Uninstall pending restart.
It's the same issue as with "normal" Java programs, Java just doesn't support this functionality. See, for example, answers to this question.
For Python & R it's easier because they support notebook-scoped libraries, where different notebooks can have different versions of the same library.
P.S. If you're doing unit/integration testing, my recommendation would be to execute tests as Databricks jobs - it will be cheaper, and you won't have conflict between different versions.
In addition to what's mentioned in the docs: when working with notebooks you could understand what's added on the driver by running this in a notebook cell:
%sh
ls /local_disk0/tmp/ | grep addedFile
This worked for me on Azure Databricks and it will list you all added jars.
Maybe force a cleanup with init scripts ?

Getting class not found exception when trying to use external jar in yarn cluster

I am trying to use an external jar in my code. It is working fine in my local eclipse set-up as I have added the jar as Referenced Libraries, however when I try to create a Jar of my code and deploy it in Azure Data Factory - Spark Activity, it is not able to get the external jar in cluster(driver-node or executor-node) I am not sure.
I have created the spark session as below :-
val spark = SparkSession.builder()
.appName("test_cluster")
.master("yarn")
.getOrCreate()
And I have added the external Jar later in code as below :-
spark.sparkContext.addJar(jarPath)
spark.conf.set("spark.driver.extraClassPath", jarPath)
spark.conf.set("spark.executor.extraClassPath", jarPath)
Please let me know where I am going wrong as I am getting the below error message :-
java.lang.ClassNotFoundException: Failed to find data source

Run spark-shell from sbt

The default way of getting spark shell seems to be to download the distribution from the website. Yet, this spark issue mentions that it can be installed via sbt. I could not find documentation on this. In a sbt project that uses spark-sql and spark-core, no spark-shell binary was found.
How do you run spark-shell from sbt?
From the following URL:
https://bzhangusc.wordpress.com/2015/11/20/use-sbt-console-as-spark-shell/
If you already using Sbt for your project, it’s very simple to setup Sbt Console to replace Spark-shell command.
Let’s start from the basic case. When you setup the project with sbt, you can simply run the console as sbt console
Within the console, you just need to initiate SparkContext and SQLContext to make it behave like Spark Shell
scala> val sc = new org.apache.spark.SparkContext("localhell")
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Run spark application on a different version of spark remotely

I have few spark tests that I am running fine remotely through maven on spark 1.6.0 and am using scala. Now I want to run these tests on spark2. The problem is cloudera which by default is using spark 1.6. Where is cloudera taking this version from and what do I need to do to change the default version of spark ? Also, spark 1.6 and spark 2 are present on same cluster. Both spark versions are present on top of yarn. The hadoop config files are present on the cluster which I am using to run the tests on the test environment and This is how I am getting spark context.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
val sc=new SparkContext(conf)
return sc
}
Is there any way I can specify the version in the conf files or cloudera itself ?
When submitting a new Spark Job, there are two places where you have to change the Spark-Version:
Set SPARK_HOME to the (local) path that contains the correct Spark installation. (Sometimes - especially for minor release changes - the version in SPARK_HOME does not have to be 100% correct, although I would recommend to keep things clean.)
Inform your cluster where the Spark jars are located. By default, spark-submit will upload the jars in SPARK_HOME to your cluster (this is one of the reasons why you should not mix the versions). But you can skip this upload process by hinting the cluster manager to use jars located in the hdfs. As you are using Cloudera, I assume that your cluster manager is Yarn. In this case, either set spark.yarn.jars or spark.yarn.archive to the path where the jars for the correct Spark version are located. Example: --conf spark.yarn.jar=hdfs://server:port/<path to your jars with the desired Spark version>
In any case you should make sure that the Spark version that you are using at runtime is the same as at compile time. The version you specified in your Maven, Gradle or Sbt configuration should always match the version referenced by SPARK_HOME or spark.yarn.jars.
I was able to successfully run it for spark 2.3.0. The problem that I was unable to run it on spark 2.3.0 earlier was because I had added spark-core dependency in pom.xml for version 1.6. That's why no matter what jar location we specified, it by default took spark 1.6(still figuring out why). On changing the library version, I was able to run it.

NoClassDefFoundError while running Spark depending neo4j jar(scala)

I have a program trying to connect to Neo4j database and run on Spark, testApp.scala, and I package it using sbt package to package it in a.jar with dependencies according to this_contribution (I already have the neo4j-spark-connector-2.0.0-M2.jar)
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "neo4j-contrib" % "neo4j-spark-connector" % "2.0.0-M2"
However while I tried spark-submit --class "testApp" a.jar it turns out to be
a NoClassDefFoundError
Exception in thread "main" java.lang.NoClassDefFoundError: org/neo4j/spark/Neo4j$ in the code val n = Neo4j(sc)
There are 2 more things I have to mention
1) I used jar vtf to check the content in a.jar, it only has testApp.class, no class of neo4j is in it, but the package process was success (does it mean the neo4j-spark-connector-2.0.0-M2.jar is not packaged in?)
2) I can use spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2 and type the code in testApp.scala, there is no problem (e.g. the wrong line above is val n = Neo4j(sc) but it can work in spark-shell)
You may try using the --jars option with spark-submit. For example
./bin/spark-submit --class "fully-qualified-class-name" --master "master-url" --jars "path-of-your-dependency-jar"
or you can also use spark.driver.extraClassPath="jars-class-path" to solve the issue.Hope this helps.
As the content in the .jar does not contain Neo4j class, it is the packaging problem.
What we should modify is sbt, instead of sbt package, we should use sbt clean assembly instead. This helps create a .jar pack containing all the dependencies in it.
If you use only sbt package, the compile progress is ok, but it will not pack neo4j-*.jar into your .jar. So during the run time it throws an NoClassDefError