I am new to spark and scala and I am having a hard time to submit a Spark job as YARN client. Doing this via the spark shell (spark submit) is no problem same goes for: first creating a spark job in eclipse then compile it into jar and use spark submit via the kernel shell, like:
spark-submit --class ebicus.WordCount /u01/stage/mvn_test-0.0.1.jar
However using Eclipse to directly compile and submit it to YARN seems difficult.
My project setup is the following: My cluster is running CDH cloudera 5.6. I have a maven project, using scala,My classpath / which is in sinc with my pom.xml
My code is as follows:
package test
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.TaskContext;
import akka.actor
import org.apache.spark.deploy.yarn.ClientArguments
import org.apache.spark.deploy.ClientArguments
object WordCount {
def main(args: Array[String]): Unit = {
// val workaround = new File(".");
System.getProperties().put("hadoop.home.dir", "c:\\winutil\\");
System.setProperty("SPARK_YARN_MODE", "true");
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster("yarn-client")
.set("spark.hadoop.fs.defaultFS", "hdfs://namecluster.com:8020/user/username")
.set("spark.hadoop.dfs.nameservices", "namecluster.com:8020")
.set("spark.hadoop.yarn.resourcemanager.hostname", "namecluster.com")
.set("spark.hadoop.yarn.resourcemanager.address", "namecluster:8032")
.set("spark.hadoop.yarn.application.classpath",
"/etc/hadoop/conf,"
+"/usr/lib/hadoop/*,"
+"/usr/lib/hadoop/lib/*,"
+"/usr/lib/hadoop-hdfs/*,"
+"/usr/lib/hadoop-hdfs/lib/*,"
+"/usr/lib/hadoop-mapreduce/*,"
+"/usr/lib/hadoop-mapreduce/lib/*,"
+"/usr/lib/hadoop-yarn/*,"
+"/usr/lib/hadoop-yarn/lib/*,"
+"/usr/lib/spark/*,"
+"/usr/lib/spark/lib/*,"
+"/usr/lib/spark/lib/*"
)
.set("spark.driver.host","localhost");
val sc = new SparkContext(conf);
val file = sc.textFile("hdfs://namecluster.com:8020/user/root/testdir/test.csv")
//Count number of words from a hive table (split is based on char 001)
val counts = file.flatMap(line => line.split(1.toChar)).map(word => (word, 1)).reduceByKey(_ + _)
//swap key and value with count value and sort from high to low
val test = counts.map(_.swap).sortBy(word =>(word,1), false, 5)
test.saveAsTextFile("hdfs://namecluster.com:8020/user/root/test1")
}
}
I'm receiving the next error message on the log files of the hadoop resource manager
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>/etc/hadoop/conf<CPS>/usr/lib/hadoop/*<CPS>/usr/lib/hadoop/lib/*<CPS>/usr/lib/hadoop-hdfs/*<CPS>/usr/lib/hadoop-hdfs/lib/*<CPS>/usr/lib/hadoop-mapreduce/*<CPS>/usr/lib/hadoop-mapreduce/lib/*<CPS>/usr/lib/hadoop-yarn/*<CPS>/usr/lib/hadoop-yarn/lib/*<CPS>/usr/lib/spark/*<CPS>/usr/lib/spark/lib/*<CPS>/usr/lib/spark/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH
SPARK_LOG_URL_STDERR -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stderr?start=-4096
SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1461679867178_0026
SPARK_YARN_CACHE_FILES_FILE_SIZES -> 520473
SPARK_USER -> hadriaans
SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE
SPARK_YARN_MODE -> true
SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1462288779267
SPARK_LOG_URL_STDOUT -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stdout?start=-4096
SPARK_YARN_CACHE_FILES -> hdfs://cloudera-003.fusion.ebicus.com:8020/user/hadriaans/.sparkStaging/application_1461679867178_0026/spark-yarn_2.10-1.5.0.jar#__spark__.jar
command:
{{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=49961' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver#10.29.51.113:49961/user/CoarseGrainedScheduler --executor-id 4 --hostname cloudera-002.fusion.ebicus.com --cores 1 --app-id application_1461679867178_0026 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================
16/05/03 17:19:58 INFO impl.ContainerManagementProtocolProxy: Opening proxy : cloudera-002.fusion.ebicus.com:8041
16/05/03 17:20:01 INFO yarn.YarnAllocator: Completed container container_1461679867178_0026_01_000005 (state: COMPLETE, exit status: 1)
16/05/03 17:20:01 INFO yarn.YarnAllocator: Container marked as failed: container_1461679867178_0026_01_000005. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1461679867178_0026_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any tips or advises are welcome.
From my previous experience there are two possible scenarios which might be causing this not very descriptive error (I am submitting jobs from Eclipse, but using Java)
I noticed that you are not passing the JAR to the configuration of the SparkContext. If I remove the line which passed the JAR when submitting from Eclipse my code fails with exactly the same error. So basically you setup the path to the not-yet existing JAR into your code, then you export your project as Runnable JAR, which would package all the transitive dependencies into it, to the path you have previously set in your code. This is how it looks in Java:
SparkConf sparkConfiguration = new SparkConf();
sparkConfiguration.setJars(new String[] { "path to your jar" });
Check if your cluster is healthy, you might have your tmp directories full. Check all hadoop logging files, some of them (sorry cannot remember which ones) are giving more details (some warnings) when this happens.
Related
We are using Spark-Shell REPL Mode to test various use-cases and connecting to multiple sources/sinks
We need to add custom drivers/jars in spark-defaults.conf file, I have tried to add multiple jars separated by comma
like
spark.driver.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
But its not working, Can anyone please provide details for correct syntax
Note: Verified in Linux Mint and Spark 3.0.1
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
Note: spark-shell and pyspark need to verify.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages and avro package. Also if you want to include log4j.properties, then use extraJavaOptions.
AppName and master can be provided in 2 way.
use .appName() and .master()
use .conf file
file: hellospark.py
from logger import Log4j
from util import get_spark_app_config
from pyspark.sql import SparkSession
# first approach.
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,
org.apache.spark:spark-avro_2.12:3.0.1") \
.config("spark.driver.extraJavaOptions",
"-Dlog4j.configuration=file:log4j.properties "
"-Dspark.yarn.app.container.log.dir=app-logs "
"-Dlogfile.name=hello-spark") \
.getOrCreate()
# second approach.
conf = get_spark_app_config()
spark = SparkSession.builder \
.config(conf=conf)
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
logger = Log4j(spark)
file: logger.py
from pyspark.sql import SparkSession
class Log4j(object):
def __init__(self, spark: SparkSession):
conf = spark.sparkContext.getConf()
app_name = conf.get("spark.app.name")
log4j = spark._jvm.org.apache.log4j
self.logger = log4j.LogManager.getLogger(app_name)
def warn(self, message):
self.logger.warn(message)
def info(self, message):
self.logger.info(message)
def error(self, message):
self.logger.error(message)
def debug(self, message):
self.logger.debug(message)
file: util.py
import configparser
from pyspark import SparkConf
def get_spark_app_config(enable_delta_lake=False):
"""
It will read configuration from spark.conf file to create
an instance of SparkConf(). Can be used to create
SparkSession.builder.config(conf=conf).getOrCreate()
:return: instance of SparkConf()
"""
spark_conf = SparkConf()
config = configparser.ConfigParser()
config.read("spark.conf")
for (key, value) in config.items("SPARK_APP_CONFIGS"):
spark_conf.set(key, value))
if enable_delta_lake:
for (key, value) in config.items("DELTA_LAKE_CONFIGS"):
spark_conf.set(key, value)
return spark_conf
file: spark.conf
[SPARK_APP_CONFIGS]
spark.app.name = Hello Spark
spark.master = local[3]
spark.sql.shuffle.partitions = 3
[DELTA_LAKE_CONFIGS]
spark.jars.packages = io.delta:delta-core_2.12:0.7.0
spark.sql.extensions = io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog = org.apache.spark.sql.delta.catalog.DeltaCatalog
As an example in addition to Prateek's answer, I have had some success by adding the following to the spark-defaults.conf file to be loaded when starting a spark-shell session in client mode.
spark.jars jars_added/aws-java-sdk-1.7.4.jar,jars_added/hadoop-aws-2.7.3.jar,jars_added/sqljdbc42.jar,jars_added/jtds-1.3.1.jar
Adding the exact line to the spark-defaults.conf file will load the three jar files as long as they are stored in the jars_added folder when spark-shell is run from the specific directory (doing this for me seems to mitigate the need to have the jar files loaded onto the slaves in the specified locations as well). I created the folder 'jars_added' in my $SPARK_HOME directory so whenever I run spark-shell I must run it from this directory (I have not yet worked out how to change the location the spark.jars setting uses as the initial path, it seems to default to the current directory when launching spark-shell). As hinted at by Prateek the jar files need to be comma separated.
I also had to set SPARK_CONF_DIR to $SPARK_HOME/conf (export SPARK_CONF_DIR = "${SPARK_HOME}/conf") for spark-shell to recognise the location of my config file (i.e. spark-defaults.conf). I'm using PuTTY to ssh onto the master.
Just to clarify once I have added the spark.jars jar1, jar2, jar3 to my spark-defaults.conf file I type the following to start my spark-shell session:
cd $SPARK_HOME //navigate to the spark home directory which contains the jars_added folder
spark-shell
On start up the spark-shell then loads the specified jar files from the jars_added folder
I want to use spark-submit to submit my spark application. The version of spark is 2.4.3. I can run the application by java -jar scala.jar.But there has some error when I run spark-submit master local --class HelloWorld scala.jar.
I am trying to change the submit-method including local, spark://ip:port but has not result. there is always throwing the error below when I modify path of jar anyway.
There is the code of my application.
import org.apache.spark.{SparkConf, SparkContext}
object HelloWorld {
def main(args: Array[String]): Unit = {
println("begin~!")
def conf = new SparkConf().setAppName("first").setMaster("local")
def sc = new SparkContext(conf)
def rdd = sc.parallelize(Array(1,2,3))
println(rdd.count())
println("Hello World")
sc.stop()
}
}
When I use spark-submit the error below will happen.
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/root/master
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am very sorry, the reason of the error happens is I forget to add '--' before master. So I try to run application by spark-submit --master local --class HelloWorld scala.jar. Finally, it is work fine.
I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'
I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.
I have to run a simple wordcount on a cluster Hdinsight in Azure. I have created a cluster with hadoop and spark and i have already the jar file with the code, the problem that i don't know how to set-up the cluster and the right line of code to launch spark on Azure,I want to try different combination of nodes(workers , 2-4-8) to see how the program scale.
Every time i launch the app with spark-submit in mode yarn-client, it work but always with 2 executor and 1 core taking for 1gb input text file around 3 minute,also if i set more executor and more core he take the settings but he don't use that,so i think that the problem it's with the RDD, it don't split the input file in the right mode because it create only 2 task that start in 2 worknode and the other nodes remain inactive.
The jar file it's created with sbt package.
Command to launch Spark:
spark-submit --class "SimpleApp" --master yarn-client --num-executors 2 simpleapp_2.10-1.0.jar
WordCount Code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.io._
import org.apache.hadoop.fs
import org.apache.spark.rdd.RDD
object SimpleApp {
def main(args: Array[String]){
//settingsparkcontext
val conf = new SparkConf().setAppName("SimpleApp")
val sc = new SparkContext(conf)
//settingthewordtosearch
val word = "word"
//settingtime
val now = System.nanoTime
//settingtheinputfile
val input = sc.textFile("wasb://xxx#storage.blob.core.windows.net/dizionario1gb.txt")
//wordlookup
val splittedLines = input.map(line=>line.split(""))
val find = System.nanoTime
val tot = splittedLines.map(x => x.equals(word)).count()
val w=(System.nanoTime-find)/1000000
val rw=(System.nanoTime-now)/1000000
//reportingtheresultofexecutioninatxtfile
val writer = new FileWriter("D:\\Users\\user\\Desktop\\File\\output.txt",true)
try {
writer.write("Word found "+tot+" time total "+rw+" mstimesearch "+w+" time read "+(rw-w)+"\n")
}
finally writer.close()
//terminatingthesparkserver
sc.stop()
}}
Level of Parallelism
"Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc) You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster."
Source:
https://spark.apache.org/docs/1.3.1/tuning.html