SparkSession does not pull down packages from repo in pytest suite - pyspark

I have a pytest suite. In the pytest suite, I have a majority of tests that do not need additional packages pulled down from maven via "spark.jars.packages", and 2 tests that do.
My thought was for the 2 tests that do, I will:
Stop the existing SparkSession used for the tests that do not require additional packages with spark.stop() using pytest tearDown().
Create an new SparkSession that has "spark.jars.packages" and "spark.jars.repositories" specified to download the additional required packages using pytest setUp().
This is not working as expected, and the new SparkSession that has "spark.jars.packages" and "spark.jars.repositories" does not download the additional packages.
To my surprise, the pytest suite will only pass if I run the tests that require a SparkSession with "spark.jars.packages" and "spark.jars.repositories" first in the test suite. If I run it second, I get a package not found error, and the required packages were clearly not pulled down from their repo.
My initial assumption was that on the second run, pyspark was referencing the initial SparkSession without "spark.jars.packages" and "spark.jars.repositories". However, what I have confirmed is:
At the point where the first SparkSession is stopped, and the second will be created, a SparkSession does not exist.
At the point where the first SparkContext is stopped, and the second will be created, a SparkContext does not exist.
When the second SparkSession is created, "spark.jars.packages" and "spark.jars.repositories" are correctly registered in the Spark Configuration.
What am I missing here? I dug through the pyspark source code to check why this is the case and could not figure it out. Thanks so much.

Related

How to avoid jar conflicts in a databricks workspace with multiple clusters and developers working in parallel?

We are working in an environment where multiple developers upload jars to a Databricks cluster with the following configuration:
DBR: 7.3 LTS
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
Scala: 2.12.10
Python: 3.7.5
R: R version 3.6.3 (2020-02-29)
Delta Lake: 0.7.0
Build tool: Maven
Below is our typical workflow:
STEP 0:
Build version 1 of the jar (DemoSparkProject-1.0-SNAPSHOT.jar) with the following object:
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 1.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
STEP 1:
Uninstall the old jar(s) from the cluster, and keep pushing new changes in subsequent versions with small changes to the logic. Hence, we push jars with versions 2.0-SNAPSHOT, 3.0-SNAPSHOT etc.
At a point in time, when we push the same object with the following code in the jar say (DemoSparkProject-4.0-SNAPSHOT.jar):
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 4.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
When we import this object in the notebook and run the main function we still get the old snapshot version jar println statement (EntryObjectOne: This object is from 1.0 SNAPSHOT JAR). This forces us from running a delete on the dbfs:/FileStore/jars/* and restarting the cluster and pushing the latest snapshot again to make it work.
In essence when I run sc.listJars() the active jar in the driver is the latest 4.0-SNAPSHOT jar. Yet, I still see the logic from old snapshot jars even though they are not installed on the cluster at runtime.
Resolutions we tried/implemented:
We tried using the maven shade plugin, but unfortunately, Scala does not support it. (details here).
We delete the old jars from dbfs:/FileStore/jars/* and restart the cluster and install the new jars regularly. This works, but a better approach can definitely help. (details here).
Changing the classpath manually and building the jar with different groupId using maven also helps. But with lots of objects and developers working in parallel, it is difficult to keep track of these changes.
Is this the right way of working with multiple jar versions in DataBricks? If there is a better way to handle this version conflict issue in DataBricks it will help us a lot.
You can't do with libraries packaged as Jar - when you install library, it's put into classpath, and will be removed only when you restart the cluster. Documentation says explicitly about that:
When you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the cluster, the status of the uninstalled library appears as Uninstall pending restart.
It's the same issue as with "normal" Java programs, Java just doesn't support this functionality. See, for example, answers to this question.
For Python & R it's easier because they support notebook-scoped libraries, where different notebooks can have different versions of the same library.
P.S. If you're doing unit/integration testing, my recommendation would be to execute tests as Databricks jobs - it will be cheaper, and you won't have conflict between different versions.
In addition to what's mentioned in the docs: when working with notebooks you could understand what's added on the driver by running this in a notebook cell:
%sh
ls /local_disk0/tmp/ | grep addedFile
This worked for me on Azure Databricks and it will list you all added jars.
Maybe force a cleanup with init scripts ?

EMR Notebook Scala kernel import graphframes library

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?
That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

How to add jar files in pyspark anaconda?

from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g")
sc=SparkContext.getOrCreate(conf)
dfv = sc.textFile("./part-001*.gz")
I have install pyspark thru anaconda and I can import pyspark in anaconda python. But I don't know how to add jar files in conf.
I tried
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.packages','file:///XXX,jar')
but it doesn't work.
Any proper way to add jar file here ?
The docs say:
spark.jars.packages: Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management.
Instead, you should simply use spark.jars:
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
So:
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.files','file:///XXX.jar')

Running Scala Spark Jobs on Existing EMR

I have Spark Job aggregationfinal_2.11-0.1 jar which I am running on my machine.the composition of it is as follows :
package deploy
object FinalJob {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(s"${this.getClass.getSimpleName}")
.config("spark.sql.shuffle.partitions", "4")
.getOrCreate()
//continued code
}
}
When I am running this code in local mode, it is running fine but when I am deploying this on the EMR cluster with putting its jar in main node.It is giving error as :
ClassNotFoundException : deploy.FinalJob
What am i missing here?
The best option is to deploy your uber jar(you can use sbt assembly plugin to build jar) to s3 and add spark step to EMR cluster. Please check:http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html
try to unjar it to some folder and look for target/classes using the below command jar -xvf myapp.jar. If the target classes is not containing the class you are executing then there is an issue with the way you build your jar. I would recommend maven assembly to be in your pom for packaging.

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

I have installed Cloudera VM (Single node) and inside this VM i have Spark running on top of Yarn. I would like to use Eclipse IDE (with scala plugin) for testing/learning with Spark.
If i instantiate SparkContext as following, everything works as i expected
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster("local[2]")
However, if i want now to connect to local server by changing the master to 'yarn-client' then it does not work:
val master = "yarn-client"
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster(master)
Specifically im getting following errors:
Error details displayed in the Eclipse console:
Error details from the NodeManager logs:
Here are the things i have tried so far:
1. Dependencies
I added all the dependencies through Maven repository
Cloudera version is 5.5 and corresponding Hadoop version is 2.6.0 and Spark version is 1.5.0.
2. Configurations
I added 3 path variables into Eclipse classpath:
SPARK_CONF_DIR=/etc/spark/conf/
HADOOP_CONF_DIR=/usr/lib/hadoop/
YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
Can anybody clarify me what is the problem here and ways to solve it?
I worked around it! I still don't understand what the exact problems is but i created a folder with my username in hadoop , i.e. /user/myusername directory and it worked. Anyway now i switched to Hortonworks distribution and i found it much more smoother to get started with than the Cloudera distribution.