How to create Spark connection str based on configuration? - mongodb

I have the following config:
Databricks Runtime Version
5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
Is it a correct connection string for Spark? I've never created it before.
conn_str = "org.apache.spark:spark-avro_2.11:2.4.3,org.mongodb.spark:mongo-spark-connector_2.11:2.4.2"
spark = (
SparkSession.builder
.config("spark.jars.packages", connection_str)
.config("spark.ui.showConsoleProgress", False)
.getOrCreate()
)

If you're using Databricks platform, then the SparkSession is already initialized when you started the cluster and it could be too late to install packages. It's better to install these libraries one by one, using the Libraries tab in the created cluster - use the Maven coordinates part to install org.apache.spark:spark-avro_2.11:2.4.3 and org.mongodb.spark:mongo-spark-connector_2.11:2.4.2 separately. See documentation for details.

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

How to avoid jar conflicts in a databricks workspace with multiple clusters and developers working in parallel?

We are working in an environment where multiple developers upload jars to a Databricks cluster with the following configuration:
DBR: 7.3 LTS
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
Scala: 2.12.10
Python: 3.7.5
R: R version 3.6.3 (2020-02-29)
Delta Lake: 0.7.0
Build tool: Maven
Below is our typical workflow:
STEP 0:
Build version 1 of the jar (DemoSparkProject-1.0-SNAPSHOT.jar) with the following object:
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 1.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
STEP 1:
Uninstall the old jar(s) from the cluster, and keep pushing new changes in subsequent versions with small changes to the logic. Hence, we push jars with versions 2.0-SNAPSHOT, 3.0-SNAPSHOT etc.
At a point in time, when we push the same object with the following code in the jar say (DemoSparkProject-4.0-SNAPSHOT.jar):
object EntryObjectOne {
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("BatchApp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
println("EntryObjectOne: This object is from 4.0 SNAPSHOT JAR")
val df: DataFrame = Seq(
(1,"A","2021-01-01"),
(2,"B","2021-02-01"),
(3,"C","2021-02-01")
).toDF("id","value", "date")
df.show(false)
}
}
When we import this object in the notebook and run the main function we still get the old snapshot version jar println statement (EntryObjectOne: This object is from 1.0 SNAPSHOT JAR). This forces us from running a delete on the dbfs:/FileStore/jars/* and restarting the cluster and pushing the latest snapshot again to make it work.
In essence when I run sc.listJars() the active jar in the driver is the latest 4.0-SNAPSHOT jar. Yet, I still see the logic from old snapshot jars even though they are not installed on the cluster at runtime.
Resolutions we tried/implemented:
We tried using the maven shade plugin, but unfortunately, Scala does not support it. (details here).
We delete the old jars from dbfs:/FileStore/jars/* and restart the cluster and install the new jars regularly. This works, but a better approach can definitely help. (details here).
Changing the classpath manually and building the jar with different groupId using maven also helps. But with lots of objects and developers working in parallel, it is difficult to keep track of these changes.
Is this the right way of working with multiple jar versions in DataBricks? If there is a better way to handle this version conflict issue in DataBricks it will help us a lot.
You can't do with libraries packaged as Jar - when you install library, it's put into classpath, and will be removed only when you restart the cluster. Documentation says explicitly about that:
When you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the cluster, the status of the uninstalled library appears as Uninstall pending restart.
It's the same issue as with "normal" Java programs, Java just doesn't support this functionality. See, for example, answers to this question.
For Python & R it's easier because they support notebook-scoped libraries, where different notebooks can have different versions of the same library.
P.S. If you're doing unit/integration testing, my recommendation would be to execute tests as Databricks jobs - it will be cheaper, and you won't have conflict between different versions.
In addition to what's mentioned in the docs: when working with notebooks you could understand what's added on the driver by running this in a notebook cell:
%sh
ls /local_disk0/tmp/ | grep addedFile
This worked for me on Azure Databricks and it will list you all added jars.
Maybe force a cleanup with init scripts ?

Connecting HIVE from Spark/Scala

I have installed Hadoop-3.3.0 and Hive-3.1.2 in Ubuntu WSL (as windows subsystem).
I have all hadoop, YARN and hiveserver2 demons running in Ubuntu WSL.
In my windows OS (host), I open Scala IDE. Via Spark/Scala, I would like to connect to HIVE tables which are available in Ubuntu WSL.
In Windows, I have nothing related to Hadoop/HIVE installed. Everything is available only in Ubuntu WSL.
Can someone please help how to do this in Scala IDE.
I do this with Maven.
Code I use:
val spark = SparkSession
.builder
.master("local[*]")
.appName("My APP")
.config("spark.sql.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate
spark.sql("show tables").show();
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport
Thanks!

Run spark application on a different version of spark remotely

I have few spark tests that I am running fine remotely through maven on spark 1.6.0 and am using scala. Now I want to run these tests on spark2. The problem is cloudera which by default is using spark 1.6. Where is cloudera taking this version from and what do I need to do to change the default version of spark ? Also, spark 1.6 and spark 2 are present on same cluster. Both spark versions are present on top of yarn. The hadoop config files are present on the cluster which I am using to run the tests on the test environment and This is how I am getting spark context.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
val sc=new SparkContext(conf)
return sc
}
Is there any way I can specify the version in the conf files or cloudera itself ?
When submitting a new Spark Job, there are two places where you have to change the Spark-Version:
Set SPARK_HOME to the (local) path that contains the correct Spark installation. (Sometimes - especially for minor release changes - the version in SPARK_HOME does not have to be 100% correct, although I would recommend to keep things clean.)
Inform your cluster where the Spark jars are located. By default, spark-submit will upload the jars in SPARK_HOME to your cluster (this is one of the reasons why you should not mix the versions). But you can skip this upload process by hinting the cluster manager to use jars located in the hdfs. As you are using Cloudera, I assume that your cluster manager is Yarn. In this case, either set spark.yarn.jars or spark.yarn.archive to the path where the jars for the correct Spark version are located. Example: --conf spark.yarn.jar=hdfs://server:port/<path to your jars with the desired Spark version>
In any case you should make sure that the Spark version that you are using at runtime is the same as at compile time. The version you specified in your Maven, Gradle or Sbt configuration should always match the version referenced by SPARK_HOME or spark.yarn.jars.
I was able to successfully run it for spark 2.3.0. The problem that I was unable to run it on spark 2.3.0 earlier was because I had added spark-core dependency in pom.xml for version 1.6. That's why no matter what jar location we specified, it by default took spark 1.6(still figuring out why). On changing the library version, I was able to run it.

Spark unit tests with hive on local metastore

I'm using spark 2.2.0, and I would like to create unit tests for spark with hive support.
The test should relay on a metastore that is stored on the local disk (as explained in the programming guide)
I define the session in the following way:
val spark = SparkSession
.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
the creation of the spark session fails with the error:
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
I managed to work around this error by adding the following dependency:
"org.datanucleus" % "datanucleus-accessplatform-jdo-rdbms" % "3.2.9"
This is strange to me, since this library is already included in spark.
Is there another way to solve this?
I wouldn't wan't to keep track of the library and update it with every new spark version.