How to initialise SparkSession in Spark 3.x - scala

I've been trying to learn Spark & Scala, and have an environment setup in IntelliJ.
I'd previously been using SparkContext to initialise my Spark instance successfully, using the following code:
import org.apache.spark._
val sc = new SparkContext("local[*]", "SparkTest")
When I tried to start loading .csv data in, most information I found used spark.read.format("csv").load("filename.csv") but this requires initialising a SparkSession object using:
val spark = SparkSession
.master("local")
.builder()
.appName("Test")
.getOrCreate()
But when I tried to use this, there doesn't seem to be any SparkSession in org.apache.spark._ in my version of Spark 3.x.
As far as I'm aware, the use of SparkContext is the Spark 1.x method, and SparkSession is Spark 2.x where spark.sql is built-in to the SparkSession object.
My question is whether I'm incorrectly trying to load SparkSession or if there's a separate way to approach initialising Spark (and loading .csv files) in Spark 3?
Spark version: 3.3.0
Scala version: 2.13.8

If you are using Maven type project then try adding dependencies to the POM file. Otherwise, for the sake of troubleshooting, create a new Maven type project, add dependencies and check whether you are still having same issue.

Related

NoSuchMethodError on new StreamingContext

I'm trying to create a Streaming context but it keeps throwing exception in the creating StreamingContext line
Here's my code
val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Minutes(15))
And here's the stack trace
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
at org.apache.spark.streaming.scheduler.JobGenerator.liftedTree1$1(JobGenerator.scala:52)
at org.apache.spark.streaming.scheduler.JobGenerator.<init>(JobGenerator.scala:51)
at org.apache.spark.streaming.scheduler.JobScheduler.<init>(JobScheduler.scala:55)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:184)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:76)
at vn.fpt.fplay.kafka.StreamConsumer$.main(StreamConsumer.scala:19)
at vn.fpt.fplay.kafka.StreamConsumer.main(StreamConsumer.scala)
I've searched everywhere but can't find out what this error is? Does anybody know? Any help would be much appreciated.
As the stack trace says
org.apache.spark.streaming.scheduler.JobGenerator (class in spark streaming) is trying to call a method org.apache.spark.util.Utils (class in spark core)
There are two reasons for this.
Spark Core is not added to your project dependency.
There's a mismatch of version between spark-core and spark streaming library
Check your sbt / maven and try to add or change to suitable version.

error not found value spark import spark.implicits._ import spark.sql

I am using hadoop 2.7.2 , hbase 1.4.9, spark 2.2.0, scala 2.11.8 and java 1.8 on a hadoop cluster which is composed of one master and two slave.
when I run spark-shell after starting the cluster , it works fine.
I am trying to connect to hbase using scala by following this tutorial : [https://www.youtube.com/watch?v=gGwB0kCcdu0][1] .
But when I try like he does to run the spark-shell by adding those jars like argument I have this error:
spark-shell --jars
"hbase-annotations-1.4.9.jar,hbase-common-1.4.9.jar,hbase-protocol-1.4.9.jar,htrace-core-3.1.0-incubating.jar,zookeeper-3.4.6.jar,hbase-client-1.4.9.jar,hbase-hadoop2-compat-1.4.9.jar,metrics-json-3.1.2.jar,hbase-server-1.4.9.jar"
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
and after that even I log out and run spark-shell another time I have the same issue.
Can any one tell me please what is the cause and how to fix it .
In your import statement spark should be an object of type SparkSession. That object should have been created previously for you. Or you need to create it yourself (read spark docs). I didn't watch your tutorial video.
The point is it doesn't have to be called spark. It could be for instance called sparkSession and then you can do import sparkSession.implicits._

Change Hadoop version for Spark

How can I set a Hadoop version for the Spark application without submitting a jar and defining specific Hadoop binary? And is it even possible?
I am just not really sure how can Hadoop version be changed while submitting Spark application.
Something like this does not work:
val sparkSession = SparkSession
.builder
.master("local[*]")
.appName("SparkJobHDFSApp")
.getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("hadoop.common.configuration.version", "2.7.4")
It can't be. The Spark Master and Workers each have their own Hadoop JARs on the classpath with which your own application must be compatible with

Spark unit tests with hive on local metastore

I'm using spark 2.2.0, and I would like to create unit tests for spark with hive support.
The test should relay on a metastore that is stored on the local disk (as explained in the programming guide)
I define the session in the following way:
val spark = SparkSession
.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
the creation of the spark session fails with the error:
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
I managed to work around this error by adding the following dependency:
"org.datanucleus" % "datanucleus-accessplatform-jdo-rdbms" % "3.2.9"
This is strange to me, since this library is already included in spark.
Is there another way to solve this?
I wouldn't wan't to keep track of the library and update it with every new spark version.

Spark 2.0 set jars

I am upgrading to spark 2.0 from 1.6 in a play-scala application and am not quite sure how to set the jar files I want. Previously a SparkConf would be defined and one of the methods I could call was setJars, which allowed me to specify all of the jar files I wanted. Now I am using SparkSession builder to construct my spark conf and spark context and I do not see any similar methods for specifying the jar files? How can I do this?
Here is how I previously created my sparkconf:
val sparkConf = new SparkConf().setMaster(sparkMaster).setAppName(sparkAppName).
set("spark.yarn.jar", "hdfs:///user/hadoop/spark-assembly-1.6.1-hadoop2.7.2.jar").
set("spark.eventLog.dir", "hdfs:///var/log/spark/apps").
set("spark.eventLog.enabled", "true").
set("spark.executorEnv.JAVA_HOME", "/usr/lib/jvm/jre-1.8.0-openjdk").
setJars(Seq(
"ALL JAR FILES LISTED HERE"
))
What can I do using sparksession builder to accomplish the same thing as "setJars"?
You can use .config(key, value) method to set spark.jars:
SparkSession.builder
.appName(sparkAppName)
.master(sparkMaster)
.config("spark.jars", commaSeparatedListOfJars)
.config(/* other stuff */)
.getOrCreate()