Registring Kryo classes is not working - scala

I have the following code :
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
new conf.registerKryoClasses(new Class<?>[]{
Class.forName("org.apache.hadoop.io.LongWritable"),
Class.forName("org.apache.hadoop.io.Text")
});
But I am bumping into the following error :
')' expected but '[' found.
[error] new conf.registerKryoClasses(new Class<?>[]{
How can I solve this problem ?

You're mixing Scala and Java. In Scala, you can define an Array[Class[_]] (instead of a Class<?>[]):
val conf = new SparkConf()
.setAppName("MyApp")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array[Class[_]](
Class.forName("org.apache.hadoop.io.LongWritable"),
Class.forName("org.apache.hadoop.io.Text")
));
val sc = new SparkContext(conf)
We can even do a little better. In order not to get our classes wrong using string literals, we can actually utilize the classes and use classOf to get their class type:
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val conf = new SparkConf()
.setAppName("MyApp")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array[Class[_]](
classOf[LongWritable],
classOf[Test],
))
val sc = new SparkContext(conf)

Related

How to config gcs-connector in local environment properly

I'm trying to config gcs-connector in my scala project but I always get java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
Here is my project config:
val sparkConf = new SparkConf()
.set("spark.executor.memory", "4g")
.set("spark.executor.cores", "2")
.set("spark.driver.memory", "4g")
.set("temporaryGcsBucket", "some-bucket")
val spark = SparkSession.builder()
.config(sparkConf)
.master("spark://spark-master:7077")
.getOrCreate()
val hadoopConfig = spark.sparkContext.hadoopConfiguration
hadoopConfig.set("fs.gs.auth.service.account.enable", "true")
hadoopConfig.set("fs.gs.auth.service.account.json.keyfile", "./path-to-key-file.json")
hadoopConfig.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoopConfig.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
I tried to set gcs-connector using both:
.set("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.6")
.set("spark.driver.extraClassPath", ":/home/celsomarques/Desktop/gcs-connector-hadoop2-2.1.6.jar")
But neither of them load the specified class to classpath.
Could you point me what I'm doing wrong, please?
The following config worked:
val sparkConf = new SparkConf()
.set("spark.executor.memory", "4g")
.set("spark.executor.cores", "2")
.set("spark.driver.memory", "4g")
val spark = SparkSession.builder()
.config(sparkConf)
.master("local")
.getOrCreate()

Can SparkContext and StreamingContext co-exist in the same program?

I am trying to set up a Sparkstreaming code which reads line from the Kafka server but processes it using rules written in another local file. I am creating streamingContext for the streaming data and sparkContext for other applying all other spark features - like string manipulation, reading local files etc
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("ReadLine")
val ssc = new StreamingContext(sparkConf, Seconds(15))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val sentence = lines.toString
val conf = new SparkConf().setAppName("Bi Gram").setMaster("local[2]")
val sc = new SparkContext(conf)
val stringRDD = sc.parallelize(Array(sentence))
But this throws the following error
Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:81)
One application can only have ONE SparkContext. StreamingContext is created on SparkContext. Just need to create ssc StreamingContext using SparkContext
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(15))
If using the following constructor.
StreamingContext(conf: SparkConf, batchDuration: Duration)
It internally create another SparkContext
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
the SparkContext can get from StreamingContext by
ssc.sparkContext
yes you can do it
you have to first start spark session and
then use its context to start any number of streaming context
val spark = SparkSession.builder().appName("someappname").
config("spark.sql.warehouse.dir",warehouseLocation).getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
Simple!!!

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.
You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF
For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}

Spark Streaming StreamingContext error

Hi i am started spark streaming learning but i can't run an simple application
My code is here
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
And i am getting error like as the following
scala> val newscc = new StreamingContext(conf, Seconds(1))
15/10/21 13:41:18 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
Thanks
If you are using spark-shell, and it seems like you do, you should not instantiate StreamingContext using SparkConf object, you should pass shell-provided sc directly.
This means:
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
becomes,
val ssc = new StreamingContext(sc, Seconds(1))
It looks like you work in the Spark Shell.
There is already a SparkContext defined for you there, so you don't need to create a new one. The SparkContext in the shell is available as sc
If you need a StreamingContext you can create one using the existing SparkContext:
val ssc = new StreamingContext(sc, Seconds(1))
You only need the SparkConf and SparkContext if you create an application.

Class not found in simple spark application

I'm new to Spark and wrote a very simple Spark application in Scala as below:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object test2object {
def main(args: Array[String]) {
val logFile = "src/data/sample.txt"
val sc = new SparkContext("local", "Simple App", "/path/to/spark-0.9.1-incubating",
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numTHEs = logData.filter(line => line.contains("the")).count()
println("Lines with the: %s".format(numTHEs))
}
}
I'm coding in Scala IDE and included the spark-assembly.jar into my code. I generate a jar file from my project and submit that to my local spark cluster using this command spark-submit --class test2object --master local[2] ./file.jar but I get this error message:
Exception in thread "main" java.lang.NoSuchMethodException: test2object.main([Ljava.lang.String;)
at java.lang.Class.getMethod(Class.java:1665)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:649)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What is wrong here?
p.s. my source code is under the project root directory (project/test2object.scala)
I didn't use spark 0.9.1 before, but I believed the problem came from this line of code:
val sc = new SparkContext("local", "Simple App", "/path/to/spark-0.9.1-incubating", List("target/scala-2.10/simple-project_2.10-1.0.jar"))
If you change to this:
val conf = new SparkConf().setAppName("Simple App")
val sc = new SparkContext(conf)
This will work.