SBT : Running Spark job on remote cluster from sbt - scala

I have a spark-job (lets call it wordcount) written in Scala, which I am able to run in following manners
Run on a local spark instance from within sbt
sbt> runMain WordCount [InputFile] [Otuputdir] local[*]
Run on a remote spark cluster spark-submit the jar
sbt> package
$> spark-submit --master spark://192.168.1.1:7077 --class WordCount target/scala-2.10/wordcount_2.10-1.5.0-SNAPSHOT.jar [InputFile] [Otuputdir]
Code :
// get arguments
val inputFile = args(0)
val outputDir = args(1)
// if 3rd argument defined then use it
val conf = if ( args.length == 3 ) new SparkConf().setAppName("WordCount").setMaster(args(2)) else new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
How can I run this job on remote spark cluster from SBT ?

There is a sbt plugin for spark-submit. https://github.com/saurfang/sbt-spark-submit

Related

SparkException: Cannot load main class from JAR file:/root/master

I want to use spark-submit to submit my spark application. The version of spark is 2.4.3. I can run the application by java -jar scala.jar.But there has some error when I run spark-submit master local --class HelloWorld scala.jar.
I am trying to change the submit-method including local, spark://ip:port but has not result. there is always throwing the error below when I modify path of jar anyway.
There is the code of my application.
import org.apache.spark.{SparkConf, SparkContext}
object HelloWorld {
def main(args: Array[String]): Unit = {
println("begin~!")
def conf = new SparkConf().setAppName("first").setMaster("local")
def sc = new SparkContext(conf)
def rdd = sc.parallelize(Array(1,2,3))
println(rdd.count())
println("Hello World")
sc.stop()
}
}
When I use spark-submit the error below will happen.
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/root/master
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am very sorry, the reason of the error happens is I forget to add '--' before master. So I try to run application by spark-submit --master local --class HelloWorld scala.jar. Finally, it is work fine.

Error in running Scala in terminal: "object apache is not a member of package org"

I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'

ClassNotFoundException: com.databricks.spark.csv.DefaultSource

I am trying to export data from Hive using spark scala. But I am getting following error.
Caused by: java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
My scala script is like below.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM sparksdata")
df.write.format("com.databricks.spark.csv").save("/root/Desktop/home.csv")
I have also try this command but still is not resolved please help me.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
If you wish to run that script the way you are running it, you'll need to use the --jars for local jars or --packages for remote repo when you run the command.
So running the script should be like this :
spark-shell -i /path/to/script/scala --packages com.databricks:spark-csv_2.10:1.5.0
If you'd also want to stop the spark-shell after the job is done, you'll need to add :
System.exit(0)
by the end of your script.
PS: You won't be needing to fetch this dependency with spark 2.+.

Scala Word count jar is not running in spark

I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.

Failed to load com.databricks.spark.csv while running with spark-submit

I am trying to run my code with spark-submit with the below command.
spark-submit --class "SampleApp" --master local[2] target/scala-2.11/sample-project_2.11-1.0.jar
And my sbt file is having below dependencies:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "1.5.2"
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.2.0"
My code :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.SQLContext
object SampleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Sample App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext._
import sqlContext.implicits._
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/root/input/Account.csv", "header" -> "true"))
val column_names = df.columns
val row_count = df.count
val column_count = column_names.length
var pKeys = ArrayBuffer[String]()
for ( i <- column_names){
if (row_count == df.groupBy(i).count.count){
pKeys += df.groupBy(i).count.columns(0)
}
}
pKeys.foreach(print)
}
}
The error:
16/03/11 04:47:37 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1253)
My Spark Version is 1.4.1 and Scala is 2.11.7
(I am following this link: http://www.nodalpoint.com/development-and-deployment-of-spark-applications-with-scala-eclipse-and-sbt-part-1-installation-configuration/)
I have tried below versions of spark csv
spark-csv_2.10 1.2.0
1.4.0
1.3.1
1.3.0
1.2.0
1.1.0
1.0.3
1.0.2
1.0.1
1.0.0
etc.
Please help!
Since you are running the job in local mode, add external jar path using --jar option
spark-submit --class "SampleApp" --master local[2] --jar file:[path-of-spark-csv_2.11.jar],file:[path-of-other-dependency-jar] target/scala-2.11/sample-project_2.11-1.0.jar
e.g.
spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/com‌​mons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar
Another thing you can do is create a fat jar. In SBT you can try this proper-way-to-make-a-spark-fat-jar-using-sbt and in Maven refer create-a-fat-jar-file-maven-assembly-plugin
Note: Mark scope of Spark's (i.e. spark-core, spark-streaming, spark-sql etc) jar as provided otherwise fat jar will become too fat to deploy.
Better solution is to use --packages option like below.
spark-submit --class "SampleApp" --master local[2] --packages com.databricks:spark-csv_2.10:1.5.0 target/scala-2.11/sample-project_2.11-1.0.jar
Make sure that --packages option precedes the application jar
you have added spark-csv library to your sbt config - it means that you can compile your code with it,
but it still doesn't mean that it's present in runtime(spark sql and spark core are there by default)
so try to use --jars option of spark-submit to add spark-csv jar to runtime classpath or you can build fat-jar(not sure how you doing it with sbt)
You are using the Spark 1.3 syntax of loading the CSV file into a dataframe.
If you check the repository here, you should use the following syntax on Spark 1.4 and higher:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
I was looking for an option where in I could skip the --packages option and provide it directly in the assembly jar. The reason I faced this exception was
sqlContext.read.format("csv") which meant it should know the data format of csv. Instead, to specify where the format csv is present use sqlContext.read.format("com.databricks.spark.csv") so it knows where to look for it and does not throw an exception.