Spark 2.0 set jars - scala

I am upgrading to spark 2.0 from 1.6 in a play-scala application and am not quite sure how to set the jar files I want. Previously a SparkConf would be defined and one of the methods I could call was setJars, which allowed me to specify all of the jar files I wanted. Now I am using SparkSession builder to construct my spark conf and spark context and I do not see any similar methods for specifying the jar files? How can I do this?
Here is how I previously created my sparkconf:
val sparkConf = new SparkConf().setMaster(sparkMaster).setAppName(sparkAppName).
set("spark.yarn.jar", "hdfs:///user/hadoop/spark-assembly-1.6.1-hadoop2.7.2.jar").
set("spark.eventLog.dir", "hdfs:///var/log/spark/apps").
set("spark.eventLog.enabled", "true").
set("spark.executorEnv.JAVA_HOME", "/usr/lib/jvm/jre-1.8.0-openjdk").
setJars(Seq(
"ALL JAR FILES LISTED HERE"
))
What can I do using sparksession builder to accomplish the same thing as "setJars"?

You can use .config(key, value) method to set spark.jars:
SparkSession.builder
.appName(sparkAppName)
.master(sparkMaster)
.config("spark.jars", commaSeparatedListOfJars)
.config(/* other stuff */)
.getOrCreate()

Related

How to initialise SparkSession in Spark 3.x

I've been trying to learn Spark & Scala, and have an environment setup in IntelliJ.
I'd previously been using SparkContext to initialise my Spark instance successfully, using the following code:
import org.apache.spark._
val sc = new SparkContext("local[*]", "SparkTest")
When I tried to start loading .csv data in, most information I found used spark.read.format("csv").load("filename.csv") but this requires initialising a SparkSession object using:
val spark = SparkSession
.master("local")
.builder()
.appName("Test")
.getOrCreate()
But when I tried to use this, there doesn't seem to be any SparkSession in org.apache.spark._ in my version of Spark 3.x.
As far as I'm aware, the use of SparkContext is the Spark 1.x method, and SparkSession is Spark 2.x where spark.sql is built-in to the SparkSession object.
My question is whether I'm incorrectly trying to load SparkSession or if there's a separate way to approach initialising Spark (and loading .csv files) in Spark 3?
Spark version: 3.3.0
Scala version: 2.13.8
If you are using Maven type project then try adding dependencies to the POM file. Otherwise, for the sake of troubleshooting, create a new Maven type project, add dependencies and check whether you are still having same issue.

NoSuchMethodError on new StreamingContext

I'm trying to create a Streaming context but it keeps throwing exception in the creating StreamingContext line
Here's my code
val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Minutes(15))
And here's the stack trace
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
at org.apache.spark.streaming.scheduler.JobGenerator.liftedTree1$1(JobGenerator.scala:52)
at org.apache.spark.streaming.scheduler.JobGenerator.<init>(JobGenerator.scala:51)
at org.apache.spark.streaming.scheduler.JobScheduler.<init>(JobScheduler.scala:55)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:184)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:76)
at vn.fpt.fplay.kafka.StreamConsumer$.main(StreamConsumer.scala:19)
at vn.fpt.fplay.kafka.StreamConsumer.main(StreamConsumer.scala)
I've searched everywhere but can't find out what this error is? Does anybody know? Any help would be much appreciated.
As the stack trace says
org.apache.spark.streaming.scheduler.JobGenerator (class in spark streaming) is trying to call a method org.apache.spark.util.Utils (class in spark core)
There are two reasons for this.
Spark Core is not added to your project dependency.
There's a mismatch of version between spark-core and spark streaming library
Check your sbt / maven and try to add or change to suitable version.

Change Hadoop version for Spark

How can I set a Hadoop version for the Spark application without submitting a jar and defining specific Hadoop binary? And is it even possible?
I am just not really sure how can Hadoop version be changed while submitting Spark application.
Something like this does not work:
val sparkSession = SparkSession
.builder
.master("local[*]")
.appName("SparkJobHDFSApp")
.getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("hadoop.common.configuration.version", "2.7.4")
It can't be. The Spark Master and Workers each have their own Hadoop JARs on the classpath with which your own application must be compatible with

Spark unit tests with hive on local metastore

I'm using spark 2.2.0, and I would like to create unit tests for spark with hive support.
The test should relay on a metastore that is stored on the local disk (as explained in the programming guide)
I define the session in the following way:
val spark = SparkSession
.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
the creation of the spark session fails with the error:
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
I managed to work around this error by adding the following dependency:
"org.datanucleus" % "datanucleus-accessplatform-jdo-rdbms" % "3.2.9"
This is strange to me, since this library is already included in spark.
Is there another way to solve this?
I wouldn't wan't to keep track of the library and update it with every new spark version.

How to compile a spark-cassandra programs using scala?

Lately I started learning spark and cassandra, I know that we can use spark in both python and scala and java, and I 've read docs on this website: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md, the thing is, after I create a program named testfile.scala with those codes the document says,(I don't know if I am right using .scala), however, i don't know how to compile it,can anyone guide me what to do with it?
Here are the testfile.scala:
import com.datastax.spark.connector._
import com.datastax.spark.connector.streaming._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ssc.actorStream[String](Props[SimpleStreamingActor], actorName, StorageLevel.MEMORY_AND_DISK)
val wc = stream.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)
Scala projects are compiled by scalac, but it's quite low level: you have to setup build paths and manage all dependencies, so most people fall back to some build tool such as sbt which will manage a lot of stuff for you. The other two commonly used built tools are maven, which is favored by java old-schoolers and gradle, which is more down to earth
> how to import spark-cassandra-connector
I've set up example project. Basically, you define all of your dependencies in built.sbt or it's analog, here is how dependency on spark-cassandra-connector is defined (line #12).
> And, is it a rule that we have to code with class or object
Yes and no. If you code with sbt, all your code files has to be wrapped into object, but, sbt allows you to code in it's shell and code that you input to it is not required to be wrapped (same rules as with ordinary scala REPL). Next, both IDEA and Eclipse have worksheet capabilities, so you can create test.sc and draft your code there.