Error while submitting a spark scala job using spark-submit - scala

I have written a simple app in scala using Eclipse -> New Scala Project.
I am using Scala 2.10.6 and Spark 2.0.2. The app is compiling without error and I also exported the jar file.
I am using the following command to execute the JAR
spark-submit TowerTest.jar --class com.IFTL.EDI.LocateTower MobLocationData Output1
The scala code snippet is as follows
package com.IFTL.EDI
import scala.math.pow
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object LocateTower {
def main(args: Array[String]){
//create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("TowerCount"))
//helper to add locations function used as a helper in finding tower
centroid
def addLocations(p1: (Double,Double), p2: (Double,Double)) ={
(p1._1 + p2._1,p1._2 + p2._2)
}
}
This is not the full code. when I run this, I get the following error.
[cloudera#quickstart ~]$ spark-submit --class com.IFTL.EDI.LocateTower
TowerTest.jar MobLocationData LocationOut1
java.lang.ClassNotFoundException: com.IFTL.EDI.LocateTower
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.util.Utils$.classForName(Utils.scala:176)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$
runMain(SparkSubmit.scala:689)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am new to spark and scala, so not sure what I'm missing.

Try with this order, if you want to pass some parameters you can put them after the jar file now, make sure you specify path to the jar or run it from jar location
spark-submit --class com.IFTL.EDI.LocateTower /Users/myJarFilePath/TowerTest.jar
Try like that first, ones you have it working you can add command line arguments

Related

How do I set up Scala in VS Code to run a simple "Hello World!" statement?

I am trying to run Scala in VS Code, a simple hello world statement, but I get this output:
PS C:\Users\thmga\OneDrive\Documents\Notebook\vscode_test> scala "c:\Users\thmga\OneDrive\Documents\Notebook\vscode_test\test.scala"
Exception in thread "main" dotty.tools.scripting.StringDriverException: No main methods detected for [c:\Users\thmga\OneDrive\Documents\Notebook\vscode_test\test.scala]
at dotty.tools.scripting.StringDriverException$.apply(StringDriver.scala:47)
at dotty.tools.scripting.Util$.detectMainClassAndMethod(Util.scala:51)
at dotty.tools.scripting.ScriptingDriver.compileAndRun(ScriptingDriver.scala:28)
at dotty.tools.scripting.Main$.main(Main.scala:45)
at dotty.tools.MainGenericRunner$.run$1(MainGenericRunner.scala:249)
at dotty.tools.MainGenericRunner$.main(MainGenericRunner.scala:268)
at dotty.tools.MainGenericRunner.main(MainGenericRunner.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at coursier.bootstrap.launcher.a.a(Unknown Source)
at coursier.bootstrap.launcher.Launcher.main(Unknown Source)
The code:
#main def hello(): Unit =
println("Hello, World!")
I tried with the file named "test.sc" and "test.scala". When I used "test.scala" the icon of Scala appeared beside the file.
I installed Scala Syntax (official) and Scala (Metals). Anyone any clue? Thanks.

SparkException: Cannot load main class from JAR file:/root/master

I want to use spark-submit to submit my spark application. The version of spark is 2.4.3. I can run the application by java -jar scala.jar.But there has some error when I run spark-submit master local --class HelloWorld scala.jar.
I am trying to change the submit-method including local, spark://ip:port but has not result. there is always throwing the error below when I modify path of jar anyway.
There is the code of my application.
import org.apache.spark.{SparkConf, SparkContext}
object HelloWorld {
def main(args: Array[String]): Unit = {
println("begin~!")
def conf = new SparkConf().setAppName("first").setMaster("local")
def sc = new SparkContext(conf)
def rdd = sc.parallelize(Array(1,2,3))
println(rdd.count())
println("Hello World")
sc.stop()
}
}
When I use spark-submit the error below will happen.
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/root/master
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am very sorry, the reason of the error happens is I forget to add '--' before master. So I try to run application by spark-submit --master local --class HelloWorld scala.jar. Finally, it is work fine.

Executing hdfs commands from inside scala script

I'm trying to execute a HDFS specific command from inside the scala script being executed by Spark in cluster mode. Below the command:
val cmd = Seq("hdfs","dfs","-copyToLocal","/tmp/file.dat","/path/to/local")
val result = cmd.!!
The job fails at this stage with the error as below:
java.io.FileNotFoundException: /var/run/cloudera-scm-agent/process/2087791-yarn-NODEMANAGER/log4j.properties (Permission denied)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:262)
at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:108)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
However, when I run the same command separately in Spark shell, it executes just fine and the file is copied as well.
scala> val cmd = Seq("hdfs","dfs","-copyToLocal","/tmp/file_landing_area/file.dat","/tmp/local_file_area")
cmd: Seq[String] = List(hdfs, dfs, -copyToLocal, /tmp/file_landing_area/file.dat, /tmp/local_file_area)
scala> val result = cmd.!!
result: String = ""
I don't understand the permission denied error. Although it displays as a FileNotFoundException. Totally confusing.
Any ideas?
As per error, it is checking hdfs data into var folder which I suspect configuration issue or it is not pointing to correct one.
Using seq and executing HDFS command is not good practise. It is useful only for spark shell. Using same approach in code not advisable. Instead of this try to use below Scala File system API to move data From or To HDFS. Please check below sample code just for reference that might help you.
import org.apache.hadoop.fs
import org.apache.hadoop.fs._
val conf = new Configuration()
val fs = path.getFileSystem(conf)
val hdfspath = new Path("hdfs:///user/nikhil/test.csv")
val localpath = new Path("file:///home/cloudera/test/")
fs.copyToLocalFile(hdfspath,localpath)
Please use below link for more reference regarding Scala File system API.
https://hadoop.apache.org/docs/r2.9.0/api/org/apache/hadoop/fs/FileSystem.html#copyFromLocalFile(boolean,%20boolean,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.Path)

Running a spark structured streaming application scala code written in a file using nohup

I have a spark structured streaming scala code written to run in a batch mode . I am trying to run it using
nohup spark2-shell -i /home/sandeep/spark_test.scala --master yarn --deploy-mode client
here is the spark_test.scala file
import org.apache.spark.sql._
import org.apache.spark.sql.types.StructType
import org.apache.spark.SparkConf
val data_schema1 = new StructType().add("a","string").add("b","string").add("c","string")
val data_schema2 = new StructType().add("d","string").add("e","string").add("f","string")
val data1 = spark.readStream.option("sep", ",").schema(data_schema1).csv("/tmp/data1/")
val data2 = spark.readStream.option("sep", ",").schema(data_schema2).csv("/tmp/data2/")
data1.createOrReplaceTempView("sample_data1")
data2.createOrReplaceTempView("sample_data2")
val df = sql("select sd1.a,sd1.b,sd2.e,sd2.f from sample_data1 sd1,sample_Data2 sd2 ON sd1.a = sd2.d")
df.writeStream.format("csv").option("format", "append").option("path", "/tmp/output").option("checkpointLocation", "/tmp/output_cp").outputMode("append").start()
I need the application to run in background even when the terminal is closed .
This is a very small application and I don't want it to be submitted using spark submit .The code is running file when ran without nohup , but when I am using nohup I am getting the below error.
java.io.IOException: Bad file descriptor
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:229)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
at java.io.BufferedInputStream.read(BufferedInputStream.java:246)
at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at mypackage.MyXmlParser.parseFile(MyXmlParser.java:397)
at mypackage.MyXmlParser.access$500(MyXmlParser.java:51)
at mypackage.MyXmlParser$1.call(MyXmlParser.java:337)
at mypackage.MyXmlParser$1.call(MyXmlParser.java:328)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:665)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:690)
at java.lang.Thread.run(Thread.java:799)
Add & at the end of your nohup.
"&" symbol at the end of the command instructs bash to run nohup mycommand in the background.
nohup spark2-shell -i /home/sandeep/spark_test.scala --master yarn --deploy-mode client &
Refer this link for more details regards to nohup command.

ClassNotFoundException: com.databricks.spark.csv.DefaultSource

I am trying to export data from Hive using spark scala. But I am getting following error.
Caused by: java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
My scala script is like below.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM sparksdata")
df.write.format("com.databricks.spark.csv").save("/root/Desktop/home.csv")
I have also try this command but still is not resolved please help me.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
If you wish to run that script the way you are running it, you'll need to use the --jars for local jars or --packages for remote repo when you run the command.
So running the script should be like this :
spark-shell -i /path/to/script/scala --packages com.databricks:spark-csv_2.10:1.5.0
If you'd also want to stop the spark-shell after the job is done, you'll need to add :
System.exit(0)
by the end of your script.
PS: You won't be needing to fetch this dependency with spark 2.+.