Class not found in simple spark application - scala

I'm new to Spark and wrote a very simple Spark application in Scala as below:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object test2object {
def main(args: Array[String]) {
val logFile = "src/data/sample.txt"
val sc = new SparkContext("local", "Simple App", "/path/to/spark-0.9.1-incubating",
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numTHEs = logData.filter(line => line.contains("the")).count()
println("Lines with the: %s".format(numTHEs))
}
}
I'm coding in Scala IDE and included the spark-assembly.jar into my code. I generate a jar file from my project and submit that to my local spark cluster using this command spark-submit --class test2object --master local[2] ./file.jar but I get this error message:
Exception in thread "main" java.lang.NoSuchMethodException: test2object.main([Ljava.lang.String;)
at java.lang.Class.getMethod(Class.java:1665)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:649)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What is wrong here?
p.s. my source code is under the project root directory (project/test2object.scala)

I didn't use spark 0.9.1 before, but I believed the problem came from this line of code:
val sc = new SparkContext("local", "Simple App", "/path/to/spark-0.9.1-incubating", List("target/scala-2.10/simple-project_2.10-1.0.jar"))
If you change to this:
val conf = new SparkConf().setAppName("Simple App")
val sc = new SparkContext(conf)
This will work.

Related

scala-submit java.lang.ClassNotFoundException

spark 2.7 scala 2.12.7 ,when i use spark-submit submit a simple project --WordCount, i ensure package and className is OK, but still have a error
java.lang.ClassNotFoundException
as my code:
1../bin/spark-submit --master spark://localhost.localdomain:7077 --class sparkTes.WordCount.scala /java/spark/scala.jar
2.enter image description here
3.spark code
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("wordcount");
val sc = new SparkContext(conf)
val input = sc.textFile("/java/text/scala.md", 2).cache()
val lines = input.flatMap(line=>line.split(" "))
val count = lines.map(word => (word,1)).reduceByKey{case (x,y)=>x+y}
val output = count.saveAsTextFile("/java/text/WordCount")
}

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.
You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF
For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}

Spark Streaming StreamingContext error

Hi i am started spark streaming learning but i can't run an simple application
My code is here
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
And i am getting error like as the following
scala> val newscc = new StreamingContext(conf, Seconds(1))
15/10/21 13:41:18 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
Thanks
If you are using spark-shell, and it seems like you do, you should not instantiate StreamingContext using SparkConf object, you should pass shell-provided sc directly.
This means:
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
becomes,
val ssc = new StreamingContext(sc, Seconds(1))
It looks like you work in the Spark Shell.
There is already a SparkContext defined for you there, so you don't need to create a new one. The SparkContext in the shell is available as sc
If you need a StreamingContext you can create one using the existing SparkContext:
val ssc = new StreamingContext(sc, Seconds(1))
You only need the SparkConf and SparkContext if you create an application.

What are All the ways we can run a scala code in Apache-Spark?

I know there is two ways to run a scala code in Apache-Spark:
1- Using spark-shell
2- Making a jar file from our project and Use spark-submit to run it
Is there any other way to run a scala code in Apache-Spark? for example, can I run a scala object (ex: object.scala) in Apache-Spark directly?
Thanks
1. Using spark-shell
2. Making a jar file from our project and Use spark-submit to run it
3. Running Spark Job programmatically
String sourcePath = "hdfs://hdfs-server:54310/input/*";
SparkConf conf = new SparkConf().setAppName("TestLineCount");
conf.setJars(new String[] { App.class.getProtectionDomain()
.getCodeSource().getLocation().getPath() });
conf.setMaster("spark://spark-server:7077");
conf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> log = sc.textFile(sourcePath);
JavaRDD<String> lines = log.filter(x -> {
return true;
});
System.out.println(lines.count());
Scala version:
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("okka").setLevel(Level.OFF)
val logFile = "/tmp/logs.txt"
val conf = new SparkConf()
.setAppName("Simple Application")
.setMaster("local")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache
println("line count: " + logData.count())
}
}
for more detail refer to this blog post.