Change Hadoop version for Spark - scala

How can I set a Hadoop version for the Spark application without submitting a jar and defining specific Hadoop binary? And is it even possible?
I am just not really sure how can Hadoop version be changed while submitting Spark application.
Something like this does not work:
val sparkSession = SparkSession
.builder
.master("local[*]")
.appName("SparkJobHDFSApp")
.getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("hadoop.common.configuration.version", "2.7.4")

It can't be. The Spark Master and Workers each have their own Hadoop JARs on the classpath with which your own application must be compatible with

Related

How to initialise SparkSession in Spark 3.x

I've been trying to learn Spark & Scala, and have an environment setup in IntelliJ.
I'd previously been using SparkContext to initialise my Spark instance successfully, using the following code:
import org.apache.spark._
val sc = new SparkContext("local[*]", "SparkTest")
When I tried to start loading .csv data in, most information I found used spark.read.format("csv").load("filename.csv") but this requires initialising a SparkSession object using:
val spark = SparkSession
.master("local")
.builder()
.appName("Test")
.getOrCreate()
But when I tried to use this, there doesn't seem to be any SparkSession in org.apache.spark._ in my version of Spark 3.x.
As far as I'm aware, the use of SparkContext is the Spark 1.x method, and SparkSession is Spark 2.x where spark.sql is built-in to the SparkSession object.
My question is whether I'm incorrectly trying to load SparkSession or if there's a separate way to approach initialising Spark (and loading .csv files) in Spark 3?
Spark version: 3.3.0
Scala version: 2.13.8
If you are using Maven type project then try adding dependencies to the POM file. Otherwise, for the sake of troubleshooting, create a new Maven type project, add dependencies and check whether you are still having same issue.

How to create Spark connection str based on configuration?

I have the following config:
Databricks Runtime Version
5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
Is it a correct connection string for Spark? I've never created it before.
conn_str = "org.apache.spark:spark-avro_2.11:2.4.3,org.mongodb.spark:mongo-spark-connector_2.11:2.4.2"
spark = (
SparkSession.builder
.config("spark.jars.packages", connection_str)
.config("spark.ui.showConsoleProgress", False)
.getOrCreate()
)
If you're using Databricks platform, then the SparkSession is already initialized when you started the cluster and it could be too late to install packages. It's better to install these libraries one by one, using the Libraries tab in the created cluster - use the Maven coordinates part to install org.apache.spark:spark-avro_2.11:2.4.3 and org.mongodb.spark:mongo-spark-connector_2.11:2.4.2 separately. See documentation for details.

Run spark application on a different version of spark remotely

I have few spark tests that I am running fine remotely through maven on spark 1.6.0 and am using scala. Now I want to run these tests on spark2. The problem is cloudera which by default is using spark 1.6. Where is cloudera taking this version from and what do I need to do to change the default version of spark ? Also, spark 1.6 and spark 2 are present on same cluster. Both spark versions are present on top of yarn. The hadoop config files are present on the cluster which I am using to run the tests on the test environment and This is how I am getting spark context.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
val sc=new SparkContext(conf)
return sc
}
Is there any way I can specify the version in the conf files or cloudera itself ?
When submitting a new Spark Job, there are two places where you have to change the Spark-Version:
Set SPARK_HOME to the (local) path that contains the correct Spark installation. (Sometimes - especially for minor release changes - the version in SPARK_HOME does not have to be 100% correct, although I would recommend to keep things clean.)
Inform your cluster where the Spark jars are located. By default, spark-submit will upload the jars in SPARK_HOME to your cluster (this is one of the reasons why you should not mix the versions). But you can skip this upload process by hinting the cluster manager to use jars located in the hdfs. As you are using Cloudera, I assume that your cluster manager is Yarn. In this case, either set spark.yarn.jars or spark.yarn.archive to the path where the jars for the correct Spark version are located. Example: --conf spark.yarn.jar=hdfs://server:port/<path to your jars with the desired Spark version>
In any case you should make sure that the Spark version that you are using at runtime is the same as at compile time. The version you specified in your Maven, Gradle or Sbt configuration should always match the version referenced by SPARK_HOME or spark.yarn.jars.
I was able to successfully run it for spark 2.3.0. The problem that I was unable to run it on spark 2.3.0 earlier was because I had added spark-core dependency in pom.xml for version 1.6. That's why no matter what jar location we specified, it by default took spark 1.6(still figuring out why). On changing the library version, I was able to run it.

Spark unit tests with hive on local metastore

I'm using spark 2.2.0, and I would like to create unit tests for spark with hive support.
The test should relay on a metastore that is stored on the local disk (as explained in the programming guide)
I define the session in the following way:
val spark = SparkSession
.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
the creation of the spark session fails with the error:
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
I managed to work around this error by adding the following dependency:
"org.datanucleus" % "datanucleus-accessplatform-jdo-rdbms" % "3.2.9"
This is strange to me, since this library is already included in spark.
Is there another way to solve this?
I wouldn't wan't to keep track of the library and update it with every new spark version.

Spark 2.0 set jars

I am upgrading to spark 2.0 from 1.6 in a play-scala application and am not quite sure how to set the jar files I want. Previously a SparkConf would be defined and one of the methods I could call was setJars, which allowed me to specify all of the jar files I wanted. Now I am using SparkSession builder to construct my spark conf and spark context and I do not see any similar methods for specifying the jar files? How can I do this?
Here is how I previously created my sparkconf:
val sparkConf = new SparkConf().setMaster(sparkMaster).setAppName(sparkAppName).
set("spark.yarn.jar", "hdfs:///user/hadoop/spark-assembly-1.6.1-hadoop2.7.2.jar").
set("spark.eventLog.dir", "hdfs:///var/log/spark/apps").
set("spark.eventLog.enabled", "true").
set("spark.executorEnv.JAVA_HOME", "/usr/lib/jvm/jre-1.8.0-openjdk").
setJars(Seq(
"ALL JAR FILES LISTED HERE"
))
What can I do using sparksession builder to accomplish the same thing as "setJars"?
You can use .config(key, value) method to set spark.jars:
SparkSession.builder
.appName(sparkAppName)
.master(sparkMaster)
.config("spark.jars", commaSeparatedListOfJars)
.config(/* other stuff */)
.getOrCreate()